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ABSTRACT 



A microprocessor for a host computer designed to execute 
target application programs for a target computer having a 
target instruction set including the combination of code 
morphing software, and morph host processing hardware 
designed to execute instructions of a host instruction set, the 
combination of the code morphing software and the morph 
host processing hardware comprising means to translate a 
set of target instructions into instructions of a host instruc- 
tion set, means to optimize the instructions of the host 
instruction set translated from the target application program 
speculating upon the occurrence of a condition, means to 
determine under control of the code morphing software 
official state of the target computer which existed at the 
beginning of a translation of a set of target instructions 
during execution of the target application program by the 
microprocessor, means for updating state of the target com- 
puter from state of the host computer when a set of host 
instructions executes in accordance with the speculation, 
means to detect failure of the condition during the execution 
of the set of host instructions, means for updating state of the 
host computer from state of the target computer when a set 
of host instructions fails to execute in accordance with the 
speculation, and means to translate a new set of host 
instructions without the speculation when a set of host 
instructions fails to execute in accordance with the specu- 
lation. 

28 Claims, 6 Drawing Sheets 
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COMBINING HARDWARE AND SOFTWARE 
TO PROVIDE AN IMPROVED 
MICROPROCESSOR 

BACKGROUND OF THE INVENTION 

1. Field of the Invention 

This invention relates to computer systems and, more 
particularly, to methods and apparatus for providing an 
improved microprocessor. 

2. History of the Prior Art 

There are thousands of application programs which run on 
computers designed around particular families of micropro- 
cessors. The largest number of programs in existence are 
designed to run on computers (generally referred to as "IBM 
Compatible Personal Computers") using the "X86" family 
of microprocessors (including the Intel® 8088, Intel 8086, 
Intel 80186, Intel 80286, i386, i486, and progressing 
through the various Pentium® microprocessors) designed 
and manufactured by Intel Corporation of Santa Clara, Calif. 
There are many other examples of programs designed to run 
on computers using other families of processors. Because 
there are so many application programs which run on these 
computers, there is a large market for microprocessors 
capable of use in such computers, especially computers 
designed to process X86 programs. The microprocessor 
market is not only large but also quite lucrative. 

Although the market for microprocessors which are able 
to run large numbers of application programs is large and 
lucrative, it is quite difficult to design a new competitive 
microprocessor. For example, even though the X86 family 
of processors has been in existence for a number of years 
and these processors are included in the majority of com- 
puters sold and used, there are few successful competitive 
microprocessors which are able to run X86 programs. The 
reasons for this are many. 

In order to be successful, a microprocessor must be able 
to run all of the programs (including operating systems and 
legacy programs) designed for that family of processors as 
fast as existing processors without costing more than exist- 
ing processors. In addition, to be economically successful, a 
new microprocessor must do at least one of these things 
better than existing processors to give buyers a reason to 
choose the new processor over existing proven processors. 

It is difficult and expensive to make a microprocessor run 
as fast as state of the art microprocessors. Processors carry 
out instructions through primitive operations such as 
loading, shifting, adding, storing, and similar low level 
operations and respond only to such primitive instructions in 
executing any instruction furnished by an application pro- 
gram. For example, a processor designed to run the instruc- 
tions of a complicated instruction set computer (CISC) such 
as a X86 in which instructions may designate the process to 
be carried out at a relatively high level have historically 
included read only memory (ROM) which stores so-called 
micro-instructions. Each micro-instruction includes a 
sequence of primitive instructions which when run in suc- 
cession bring about the result commanded by the high level 
CISC instruction. Typically, an "add A to B" CISC instruc- 
tion is decoded to cause a look up of an address in ROM at 
which a micro-instruction for carrying out the functions of 
the "add A to B" instruction is stored. The micro-instruction 
is loaded, and its primitive instructions are run in sequence 
to cause the "add A to B" instruction to be carried out. With 
such a CISC computer, the primitive operations within a 
micro-instruction can never be changed during program 
execution. Each CISC instruction can only be run by decod- 
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ing the instruction, addressing and fetching the micro- 
instruction, and running the sequence of primitive opera- 
tions in the order provided in the micro-instruction. Each 
time the micro-instruction is run, the same sequence must be 
5 followed. 

State of the art processors for running X86 applications 
utilize a number of techniques to provide the fastest pro- 
cessing possible at a price which is still economically 
reasonable. Any new processor which implements known 
10 hardware techniques for accelerating the speed at which a 
processor may run must increase the sophistication of the 
processing hardware. This requires increasing the cost of the 
hardware. 

For example, a superscalar microprocessor which uses a 
15 plurality of processing channels in order to execute two or 
more operations at once has a number of additional require- 
ments. At the most basic level, a simple superscalar micro- 
processor might decode each application instruction into the 
micro- instructions which carry out the function of the appli- 
20 cation instruction. Then, the simple superscalar micropro- 
cessor schedules two micro -instructions to run together if 
the two micro-instructions do not require the same hardware 
resources and the execution of a micro-instruction does not 
depend on the results of other micro-instructions being 
25 processed. 

A more advanced superscalar microprocessor typically 
decodes each application instruction into a series of primi- 
tive instructions so that those primitive instructions may be 

3Q reordered and scheduled into the most efficient execution 
order. This requires that each individual primitive operation 
be addressed and fetched. To accomplish reordering, the 
processor must be able to ensure that a primitive instruction 
which requires data resulting from another primitive instruc- 

35 tion is run after that other primitive instruction produces the 
needed data. Such a superscalar microprocessor must assure 
that two primitive instructions being run together do not 
both require the same hardware resources. Such a processor 
must also resolve conditional branches before the effects of 

4Q branch operations can be completed. 

Thus, superscalar microprocessors require extensive 
hardware to compare the relationships of the primitive 
instructions to one another and to reorder and schedule the 
sequence of the primitive instructions to carry out any 

4S instruction. As the number of processing channels increases, 
the amount and cost of the hardware to accomplish these 
superscalar acceleration techniques increases approximately 
quadratically. All of these hardware requirements increase 
the complexity and cost of the circuitry involved. As in 

50 dealing with micro-instructions, each time an application 
instruction is executed, a superscalar microprocessor must 
use its relatively complicated addressing and fetching hard- 
ware to fetch each of these primitive instructions, must 
reorder and reschedule these primitive instructions based on 

55 the other primitive instructions and hardware usage, and 
then must execute all of the rescheduled primitive instruc- 
tions. The need to run each application instruction through 
the entire hardware sequence each time it is executed limits 
the speed at which a superscalar processor is capable of 

60 executing its instructions. 

Moreover, even though these various hardware tech- 
niques increase the speed of processing, the complexity 
involved in providing such hardware significantly increases 
the cost of such a microprocessor. For example, the Intel 

65 i486 DX4 processor uses approximately 1.5 million transis- 
tors. Adding the hardware required to accomplish the check- 
ing of dependencies and scheduling necessary to process 
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instructions through two channels in a basic superscalar different emulation operations. FIG. 1 includes a series of 

microprocessor such as the Intel Pentium® requires the use diagrams representing the different ways in which a plurality 

of more than three million transistors. Adding the hardware of different types of microprocessors execute target appli- 

to allow reordering among primitive instructions derived cation programs. 

from different target instructions, provide speculative 5 In FIG. 1(a), a typical CISC microprocessor such as an 
execution, allow register renaming, and provide branch Intel X86 microprocessor is shown running a target appli- 
prediction increases the number of transistors to over six cation program which is designed to be run on that target 
million in the Intel Pentium Pro™ microprocessor. Thus, it processor. As may be seen, the application is run on the 
can be seen that each hardware addition to increase opera- CISC processor using a CISC operating system (such as MS 
tion speed has drastically increased the number of transistors 10 D0S » Windows 3.1, Windows NT, and OS/2 which are used 
in the latest state of the art microprocessors. with X86 computers) designed to provide interfaces by 
Even using these known techniques may not produce a which access t0 the hardware of the computer may be 
microprocessor faster than existing microprocessors because S aiDed * Typically, the instructions of the application pro- 
manufacturers use most of the economically feasible tech- S ram are selected t0 uti,ize tne devices of tne c °mputer only 
niques known to accelerate the operation of existing micro- « trough the access provided by the operating system. Thus, 
processors. Consequently, designing a faster processor is a the °P«aling system handles the manipulations which allow 
very difficult and expensive task. applications access to memory and to the various input/ 
„ , . 4 . , c . . u A output devices of the computer. The target computer 
Reducing the cost of a processor is also very difficult. As . 
■n . . , u u i i . u • u- L includes memory and hardware which the operating system 
illustrated above, hardware acceleration techniques which . , J n . .l *• . c \ 

, cc • ' i ui 70 recognizes, and a call to the operating system from a target 

produce a sufficiently capable processor are very expensive. zu .r . S 

r\ a ■ ♦ u. • .u f -I-.- * application causes an operating system device dnver to 

One designing a new processor must obtain the facilities to rr , . K . . c , , 

produce fhe hardware Such facilities are very difficult to c f* a , n ex P ec,ed °f era l£ n !° ° cc " r w,th afflneddev.ee 

obtain because chip manufacturers do not typically spend of the , tar 8 el ,*° m P uter - 11,6 "fnicUons of the apphcat.on 

„ r j • tu • i . t t execute on the processor where they are changed into 

assets on small runs of devices. The capital investment . , . \. , . . , -v. • 

. . , . . c . . c . . 2< operations (embodied in microcode or the more primitive 

required to produce a chip manufacturing facility is so great 13 ^ . ; .... , . L , ,\ . 

*u . % • u j *u u c » ° . ' ° operations from which microcode is assembled) which the 

that it is beyond the reach of most companies. r U1 r A , , ' , , 

- . . . . processor is capable of executing. As has been described 

Even though one is able to design a new processor which abovC) each time a complicated target instruction is 

runs all of the application programs designed for a family of executed) iht instruction calls the same subroutine stored as 

processors at least as fast as competitive processors, the microcode (or as the same set of primitive operations). The 

price of competitive processors includes sufficient profit that same subroutine is always executed. If the processor is a 

substantial price reductions are sure to be faced by any SU p ersC alar, these primitive operations for carrying out a 

competitor. target instruction can often be reordered by the processor, 

Although designing a competitive processor by increasing rescheduled, and executed using the various processing 

the complexity of the hardware is very difficult, another way 35 channels in the manner described above; however, the 

to run application programs (target application programs) subroutine is still fetched and executed, 

designed for a particular family of microprocessors (target In nG ^ a typical RISC microprocessor such as a 

microprocessors) has been to emulate the target micropro- PoW erPC microprocessor used in an Apple Macintosh com- 

cessor in software on another faster microprocessor (host puter ^ represen ted running the same target application 

microprocessor). This is an incrementally inexpensive 4Q pr0 gram which is designed to be run on the CISC processor 

method of running these programs because it requires only of nG 1(a) M may be ^ the target appl i cat i on mu on 

the addition of some form of emulation software which the host proce ssor using at least a partial target operating 

enables the application program to run on a faster micro- system t0 respond t0 a portion of the calls whicQ the target 

processor. The emulator software changes the target instruc- application generates. Typically these are calls to the 

tions of an application program written for the target pro- 45 ap p Uca tion-like portions of the target operating system used 

cessor family into host instructions capable of execution by l0 provide graphical interfaces on the display and short 

the host microprocessor. These changed instructions are then utiIity programs wh i c h are generally application-like. The 

run under control of the operating system on the faster host target app ii caU orj and these portions of the target operating 

microprocessor. system are changed by a software emulator such as Soft 

There have been a number of different designs by which 50 PC® which breaks the instructions furnished by the target 
target applications may be run on host computers with faster application program and the application-like target operat- 
processors than the processors of target computers. In i n g system programs into instructions which the host pro- 
general, the host computers executing target programs using cessor and its host operating system are capable of execut- 
emulation software utilize reduced instruction set (RISC) ing. The host operating system provides the interfaces 
microprocessors because RISC processors are theoretically 55 through which access to the memory and input/output hard- 
simpler and consequently can run faster than other types of ware of the RISC computer may be gained, 
processors. However, the host RISC processor and the hardware 

However, even though RISC computer systems running devices associated with it in a host RISC computer are 

emulator software are often capable of running X86 (or usually quite different than are the devices associated with 

other) programs, they usually do so at a rate which is 60 the processor for which the target application was designed; 

substantially slower than the rate at which state of the art and the various instructions provided by the target applica- 

X86 computer systems run the same programs. Moreover, tion program are designed to cooperate with the device 

often these emulator programs are not able to run all or a drivers of the target operating system in accessing the 

large number of the target programs available. various portions of the target computer. Consequently, the 

The reasons why emulator programs are not able to run 65 emulation program, which changes the instructions of the 

target programs as rapidly as the target microprocessors is target application program to primitive host instructions 

quite complicated and requires some understanding of the which the host operating system is capable of utilizing, must 
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somehow link the operations designed to operate hardware the process of emulation. Because the correct target state 

devices in the target computer to operations which hardware must be available at the time of any such exception for 

devices of the host system are capable of implementing. proper execution, the emulator is forced to keep accurate 

Often this requires the emulator software to create virtual track of this state at all times so that it is able to correctly 

devices which respond to the instructions of the target 5 respond to these exceptions. In the prior art, this has required 

application to carry out operations which the host system is executing each instruction in the order provided by the target 

incapable of carrying out because the target devices are not application because only in this way could correct target 

those of the host computer. Sometimes the emulator is state be maintained. 

required to create links from these virtual devices through Moreover, prior art emulators have always been required 

the host operating system to host hardware devices which 10 to maintain the order of execution of the target application 

are present but are addressed in a different manner by the for other reasons. Target instructions can be of two types, 

host operating system. ones which affect memory or ones which affect a memory 

Target programs when executed in this manner run re la- mapped input/output (I/O) device. There is no way to know 
tively slowly for a number of reasons. First, each target without attempting to execute an instruction whether an 
instruction from a target application program and from the 15 operation is to affect memory or a memory-mapped I/O 
target operating system must be changed by the emulator device. When instructions operate on memory, optimizing 
into the host primitive functions used by the host processor. and reordering is possible and greatly aids in speeding the 
If the target application is designed for a CISC machine such operation of a system. However, operations affecting I/O 
as an X86, the target instructions are of varying lengths and devices often must be practiced in the precise order in which 
quite complicated so that changing them to host primitive 2 n tnose operations are programmed without the elimination of 
instructions is quite involved. The original target instruc- any steps or they may have some adverse effect on the 
tions are first decoded, and the sequence of primitive host operation of the I/O device. For example, a particular I/O 
instructions which make up the target instructions are deter- operation may have the effect of clearing an I/O register. If 
mined. Then the address (or addresses) of each sequence of the operations take place out of order so that a register is 
primitive host instructions is determined, each sequence of 2 s cleared of a value which is still necessary, then the result of 
the primitive host instructions is fetched, and these primitive the operation may be different than the operation corn- 
host instructions are executed in or out of order. The large manded by the target instruction. Without a means to dis- 
number of extra steps required by an emulator to change the tinguish memory from memory mapped I/O, it is necessary 
target application and operating system instructions into host to treat all instructions as though they affect memory 
instructions understood by the host processor must be con- 30 mapped I/O. This severely restricts the nature of optimiza- 
ducted each time an instruction is executed and slows the tions that are achievable. Because prior art emulators lack 
process of emulation. both means to detect the nature of the memory being 

Second, many target instructions include references to addressed and means to recover from such failures, they are 

operations conducted by particular hardware devices which required to proceed sequentially through the target inslruc- 

function in a particular manner in the target computer, 35 tions as though each operation affects memory mapped I/O. 

hardware which is not available in the host computer. To greatly limits the possibility of optimizing the host 

carry out the operation, the emulation software must either instructions. 

make software connections to the hardware devices of the Another problem which limits the ability of prior art 

host computer through the existing host operating system or emulators to optimize the host code is caused by self- 

the emulator software must furnish a virtual hardware 40 modifying code. If a target instruction has been changed to 

device. Emulating the hardware of another computer in a sequence of host instructions which in turn write back to 

software is very difficult. The emulation software must change the original target instruction, then the host instruc- 

generate virtual devices for each of the target application tions are no longer valid. Consequently, the emulator must 

calls to the host operating system; and each of these virtual constantly check to determine whether a store is to the target 

devices must provide calls to the actual host devices. Emu- 45 code area. All of these problems make this type of emulation 

lating a hardware device requires that when a target instruc- much slower than running a target application on a target 

tion is to use the device, the code representing the virtual processor. 

device required by that instruction be fetched from memory Another example of the type of emulation software shown 

and run to implement the device. Either of these methods of in FIG. 1(b) is described in an article entitled, "Talisman: 

solving the problem adds another series of operations to the 50 Fast and Accurate Multicomputer Simulation," R. C. 

execution of the sequence of instructions. Bedichek, Laboratory for Computer Sciences, Massachu- 

Complica ting the problem of emulation is the requirement setts Institute of Technology. This is a more complete 

that the target application take various exceptions which are example of translation in that it can emulate a complete 

carried out by hardware of the target computer and the target research system and run the research target operating sys- 

operating system in order for the computer system to oper- 55 tern. Talisman uses a host UNIX operating system, 

ate. When a target exception is taken during the operation of In FIG. 1(c), another example of emulation is shown. In 

a target computer, state of the computer at the time of the this case, a PowerPC microprocessor used in an Apple 

exception must be saved typically by calling a microcode Macintosh computer is represented running a target appli- 

sequence to accomplish the operation, the correct exception cation program which was designed to be run on the 

handler must be retrieved, the exception must be handled, 60 Motorola 68000 family CISC processors used in the original 

then the correct point in the program must be found for Macintosh computers; this type of arrangement has been 

continuing with the program. Sometimes this requires that required in order to allow Apple legacy programs to run on 

the program revert to the state of the target computer at the the Macintosh computers with RISC processors. As may be 

point the exception was taken, and at other times a branch seen, the target application is run on the host processor using 

provided by the exception handler is taken. In any case, the 65 at least a partial target operating system to respond to the 

hardware and software of the target computer required to application-like portions of the target operating system. A 

accomplish these operations must somehow be provided in software emulator breaks the instructions furnished by the 
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target application program and the application-like target case. Here the emulation software is running on a host 

operating system programs into instructions which the host operating system already designed to run similar appiica- 

processor and its host operating system are capable of tions. This allows the calls from the target applications to be 

executing. The host operating system provides the interfaces more simply directed to the correct facilities of the host and 

through which access to the memory and input/output hard- 5 the host operating system. More importantly, this system 

ware of the host computer may be gained. will run only 32 bit Windows applications which probably 

Again, the host RISC processor and the devices associ- amount to less than one percent of all X86 applications, 

ated with it in the host RISC computer are quite different Moreover, this system will run applications on only one 

than are the devices associated with the Motorola CISC operating system, Windows NT; while X86 processors run 

processor; and the various target instructions are designed to 10 applications designed for a large number of operating sys- 

cooperate with the target CISC operating system in access- tems. Such a system, therefore, could be considered not to 

ing the various portions of the target computer, be compatible within the terms expressed earlier in this 

Consequently, the emulation program must link the opera- specification. Thus, a processor running such an emulator 

tions designed to operate hardware devices in the target cannot be considered to be a competitive X86 processor, 

computer to operations which hardware devices of the host 15 Another method of emulation by which software may be 

system are capable of implementing. This requires the used to run portions of applications written for a first 

emulator to create software virtual devices which respond to instruction set on a computer which recognizes a different 

the instructions of the target application and to create links instruction set is illustrated in FIG. 1(e). This form of 

from these virtual devices through the host operating system emulation software is typically utilized by a programmer 

to host hardware devices which are present but are addressed 2 q wno mav De porting an application from one computer 

in a different manner by the host operating system. system to another. Typically, the target application is being 

The target software run in this manner runs relatively designed for some target computer other than the host 

slowly for the same reasons that the emulation of FIG. 1(b) machine on which the emulator is being run. The emulator 

runs slowly. First, each target instruction from the target software analyzes the target instructions, translates those 

application and from the target operating system must be 2 s instructions into instructions which may be run on the host 

changed by fetching the instruction; and all of the host machine, and caches those host instructions so that they may 

primitive functions derived from that instruction must be run be reused. This dynamic translation and caching allows 

in sequence each time the instruction is executed. Second, portions of applications to be run very rapidly. This form of 

the emulation software must generate virtual devices for emulator is normally used with software tracing tools to 

each of the target application calls to the host operating 30 provide detailed information about the behavior of a target 

system; and each of these virtual devices must provide calls program being run. The output of a tracing tool may, in turn, 

to the actual host devices. Third, the emulator must treat all be used to drive an analyzer program which analyzes the 

instructions as conservatively as it treats instructions which trace information. 

are directed to memory mapped I/O devices or risk gener- In order to determine how the code actually functions, an 
ating exceptions from which it cannot recover. Finally, the 35 emulator of this type, among other things, runs with the host 
emulator must maintain the correct target state at all times operating system on the host machine, furnishes the virtual 
and store operations must always check ahead to determine hardware which the host operating system does not provide, 
whether a store is to the target code area. All of these and otherwise maps the operations of the computer for 
requirements eliminate the ability of the emulator to practice which the application was designed to the hardware 
significant optimization of the code run on the host proces- 4 q resources of the host machine in order to carry out the 
sor and make this type of emulation much slower than operations of the program being run. This software virtual- 
running the target application on a target processor. Emu- izing of hardware and mapping to the host computer can be 
lation rates less than one-quarter as fast as state of the art very slow and incomplete. 

processors are considered very good. In general, this has Moreover, because it often requires a plurality of host 

relegated this type of emulation software to uses where the 45 instructions to carry out one of the target instructions, 

capability of running applications designed for another exceptions including faults and traps which require a target 

processor is useful but not primary. operating system exception handler may be generated and 

In FIG. l(d) y a particular method of emulating a target cause the host to cease processing the host instructions at a 

application program on a host processor which provides point unrelated to target instruction boundaries. When this 

relatively good performance for a very limited series of 50 happens, it may be impossible to handle the exception 

target applications is illustrated. The target application fur- correctly because the state of the host processor and memory 

nishes instructions to an emulator which changes those is incorrect. If this is the case, the emulator must be stopped 

instructions into instructions for the host processor and the and rerun to trace the operations which generated the 

host operating system. The host processor is a Digital exception. Thus, even though such an emulator may run 

Equipment Corporation Alpha RISC processor, and the host 55 sequences of target code very rapidly, it has no method for 

operating system is Microsoft NT. The only target applica- recovering from these exceptions so cannot run any signifi- 

tions which may be run by this system are 32 bit applications cant portion of an application rapidly, 

designed to be executed by a target X86 processor with a This is not a particular problem with this form of emulator 

Windows WIN32S compliant operating system. Since the because the functions being performed by the emulators, 

host and target operating systems are almost identical, being 60 tracers, and the associated analyzers are directed to gener- 

designed to handle these same instructions, the emulator ating new programs or porting old programs to another 

software may change the instructions very easily. Moreover, machine so that the speed at which the emulator software 

the host operating system is already designed to respond to runs is rarely at issue. That is, a programmer is usually not 

the same calls that the target application generates so that the interested in how fast the code produced by a emulator runs 

generation of virtual devices is considerably reduced. 65 on the host machine but in whether the emulator produces 

Although this is technically an emulation system running code which is executable on the machine for which it is 

a target application on a host processor, it is a very special designed and which will run rapidly on that machine. 
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Consequently, this type of emulation software does not 
provide a method for running application programs written 
in a first instruction set to run on a different type of 
microprocessor for other than programming purposes. An 
example of this type of emulation software is described in an 5 
article entitled, "Shade: A Fast Instruction-Set Simulator for 
Execution Profiling," Cmelik and Keppel. 

It is desirable to provide competitive microprocessors 
which are faster and less expensive than state of the art 
microprocessors yet are entirely compatible with target 10 
application programs designed for state of the art micropro- 
cessors running any operating systems available for those 
microprocessors. 

SUMMARY OF THE INVENTION is 

It is, therefore, an object of the present invention to 
provide a microprocessor which is less expensive than 
conventional state of the art microprocessors yet is compat- 
ible with and capable of running application programs and 
operating systems designed for other microprocessors at a 20 
faster rate than those other microprocessors. 

This and other objects of the present invention are real- 
ized by a microprocessor for a host computer designed to 
execute target programs for a target computer having a target 25 
instruction set comprising the combination of software, and 
enhanced host processing hardware designed to execute 
instructions of a host instruction set, the combination of the 
software and the enhanced host processing hardware com- 
prising means to translate a set of target instructions into 3Q 
instructions of a host instruction set. 

In a preferred embodiment, the combination of the soft- 
ware and the enhanced host processing hardware includes 
means to optimize the instructions of the host instruction set 
translated from the target program speculating upon the 35 
occurrence of a condition, means to determine under control 
of the software official state of the target computer which 
existed at the beginning of a translation of a set of target 
instructions during execution of the target program by the 
microprocessor, means for updating state of the target com- 40 
puter from state of the host computer when a set of host 
instructions executes in accordance with the speculation, 
means to detect failure of the condition during the execution 
of the set of host instructions, means for updating state of the 
host computer from state of the target computer when a set 45 
of host instructions fails to execute in accordance with the 
speculation, and means to translate a new set of host 
instructions without the speculation when a set of host 
instructions fails to execute in accordance with the specu- 
lation. 50 

In a preferred embodiment, the enhanced processing 
hardware of the morph host includes a very long instruction 
word (VLIW) processor designed to allow a number of 
instructions of the target instruction set to be translated, 
optimized, reordered, and rescheduled into very long 55 
instruction words which may be cached for later reuse 
thereby increasing the speed of execution by eliminating the 
need for each of these steps each time the target instruction 
is encountered. This allows the primitive instructions of the 
host instruction set which have been generated from a go 
number of instructions of the target instruction set to be 
reordered and scheduled together to effect advanced super- 
scalar operations and other time-saving operations which 
eliminate unnecessary hardware operations and accelerate 
overall operation. 65 

These and other objects and features of the invention will 
be better understood by reference to the detailed description 
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which follows taken together with the drawings in which 
like elements are referred to by like designations throughout 
the several views. 

BRIEF DESCRIPTION OF THE DRAWINGS 

FIGS. l(a)-(e) are diagrams illustrating the manner of 
operation of microprocessors designed in accordance with 
the prior art. 

FIG. 2 is a block diagram of a microprocessor designed in 
accordance with the present invention running an applica- 
tion designed for a different microprocessor. 

FIG. 3 is a diagram illustrating a portion of the micro- 
processor shown in FIG. 2. 

FIG. 4 is a block diagram illustrating a register file used 
in a microprocessor designed in accordance with the present 
invention. 

FIG. 5 is a block diagram illustrating a gated store buffer 
designed in accordance with the present invention. 

FIGS. 6(a)-(c) illustrate instructions used in various 
microprocessors of the prior art and in a microprocessor 
designed in accordance with the present invention. 

FIG. 7 illustrates a method practiced by a software portion 
of a microprocessor designed in accordance with the present 
invention. 

FIG. 8 illustrates another method practiced by a software 
portion of a microprocessor designed in accordance with the 
present invention, 

FIG. 9 is a block diagram illustrating an improved com- 
puter system including the present invention. 

FIG. 10 is a block diagram illustrating a portion of the 
microprocessor shown in FIG. 2. 

NOTATION AND NOMENCLATURE 

Some portions of the detailed descriptions which follow 
are presented in terms of symbolic representations of opera- 
tions on data bits within a computer memory. These descrip- 
tions and representations are the means used by those skilled 
in the data processing arts to most effectively convey the 
substance of their work to others skilled in the art. The 
operations are those requiring physical manipulations of 
physical quantities. Usually, though not necessarily, these 
quantities take the form of electrical or magnetic signals 
capable of being stored, transferred, combined, compared, 
and otherwise manipulated. It has proven convenient at 
times, principally for reasons of common usage, to refer to 
these signals as bits, values, elements, symbols, characters, 
terms, numbers, or the like. It should be borne in mind, 
however, that all of these and similar terms are to be 
associated with the appropriate physical quantities and are 
merely convenient labels applied to these quantities. 

Further, the manipulations performed are often referred to 
in terms, such as adding or comparing, which are commonly 
associated with mental operations performed by a human 
operator. No such capability of a human operator is neces- 
sary or desirable in most cases in any of the operations 
described herein which form part of the present invention; 
the operations are machine operations. Useful machines for 
performing the operations of the present invention include 
general purpose digital computers or other similar devices. 
In all cases the distinction between the method operations in 
operating a computer and the method of computation itself 
should be borne in mind. The present invention relates to a 
method and apparatus for operating a computer in process- 
ing electrical or other (e.g. mechanical, chemical) physical 
signals to generate other desired physical signals. 



12/09/2003, EAST Version: 1.4.1 



6,031,992 

11 12 

During the following description, in some cases the target invention compared to the speeds of prior art microp races- 
program is referred to as a program which is designed to be sors practicing the execution of native instruction sets, 
executed on an X86 microprocessor in order to provide For example, the code morphing software combined with 
exemplary details of operation because the majority of the enhanced morph host allows the use of techniques which 
emulators run X86 applications. However, the target pro- 5 allow the reordering and rescheduling of primitive instruc- 
gram may be one designed to run on any family of target tions generated by a sequence of target instructions without 
computers. This includes target virtual computers, such as requiring the addition of significant circuitry. By allowing 
Pcode machines, Postscript machines, or Java virtual lhe reordering and rescheduling of a number of target 
machines instructions together, other optimization techniques can be 

3Q used to reduce the number of processor steps which are 

DETAILED DESCRIPTION necessary to carry out a group of target instructions to fewer 

The present invention overcomes the problems of the than those required by any other microprocessors which will 

prior art and provides a microprocessor which is faster than run the target applications. 

microprocessors of the prior art, is capable of running all of The code morphing software combined with the enhanced 

the software for all of the operating systems which may be 3S morph host translates target instructions into instructions for 

run by a large number of families of prior art the morph host on the fly and caches those host instructions 

microprocessors, yet is less expensive than prior art micro- in a memory data structure (referred to in this specification 

processors. as a "translation buffer*')* The use of a translation buffer to 

Rather than using a microprocessor with more compli- hold translated instructions allows instructions to be recalled 

cated hardware to accelerate its operation, the present inven- 20 without rerunning the lengthy process of determining which 

tion combines an enhanced hardware processing portion primitive instructions are required to implement each target 

(referred to as a "morph host" in this specification) which is instruction, addressing each primitive instruction, fetching 

much simpler than state of the art microprocessors and an each primitive instruction, optimizing the sequence of primi- 

emulating software portion (referred to as "code morphing live instructions, allocating assets to each primitive 

software" in this specification) in a manner that the two 25 instruction, reordering the primitive instructions, and 

portions function together as a microprocessor with more executing each step of each sequence of primitive instruc- 

capabilities than any known competitive microprocessor. tions involved each time each target instruction is executed. 

More particularly, a morph host is a processor which Once a target instruction has been translated, it may be 

includes hardware enhancements to assist in having state of recalled from the translation buffer and executed without the 

a target computer immediately at hand when an exception or 30 need for anv of tnese myriad of steps, 

error occurs, while code morphing software is software A primary problem of prior art emulation techniques has 

which translates the instructions of a target program to been the inability of these techniques to handle with good 

morph host instructions for the morph host and responds to performance exceptions generated during the execution of a 

exceptions and errors by replacing working state with cor- target program. This is especially true of exceptions gener- 

rect target state when necessary so that correct re translations 35 ated in running the target application which are directed to 

occur. Code morphing software may also include various the target operating system where the correct target state 

processes for enhancing the speed of processing. Rather than must be available at the time of any such exception for 

providing hardware to enhance the speed of processing as do proper execution of the exception and the instructions which 

all of the very fast prior art microprocessors, the present follow. Consequently, the emulator is forced to keep accu- 

invention allows a large number of acceleration enhance- 40 rate track of the target state at all times and must constantly 

ment techniques to be carried out in selectable stages by the check to determine whether a store is to the target code area, 

code morphing software. Providing the speed enhancement Other exceptions create similar problems. For example, 

techniques in the code morphing software allows the morph exceptions can be generated by the emulator to detect 

host to be implemented using much less complicated hard- particular target operations which have been replaced by 

ware which is faster and substantially less expensive than 45 some particular host function. In particular, various hard- 

the hardware of prior art microprocessors. As a comparison, ware operations of a target processor may be replaced by 

one embodiment of the present invention designed to run all software operations provided by the emulator software, 

available X86 applications is implemented by a morph host Additionally, the host processor executing the host instruc- 

including approximately one-quarter of the number of gates tions derived from the target instructions can also generate 

of the Pentium Pro microprocessor yet runs X86 applica- 50 exceptions. All of these exceptions can occur either during 

tions substantially faster than does the Pentium Pro micro- the attempt to change target instructions into host instruc- 

processor or any other known microprocessor capable of tions by the emulator, or when the host translations are 

processing these applications. executed on the host processor. An efficient emulation must 

The code morphing software utilizes certain techniques provide some manner of recovering from these exceptions 
which have previously been used only by programmers 55 efficiently and in a manner that the exception may be 
designing new software or emulating new hardware. The correctly handled. None of the prior art does this for all 
morph host includes hardware enhancements especially software which might be emulated, 
adapted to allow the acceleration techniques provided by the In order to overcome these limitations of the prior art, the 
code morphing software to be utilized efficiently. These present invention incorporates a number of hardware 
hardware enhancements allow the code morphing software 60 improvements in its enhanced morph host. These improve- 
to implement acceleration techniques over a broader range ments include a gated store buffer and a large plurality of 
of instructions. These hardware enhancements also permit additional processor registers. Some of the additional reg- 
additional acceleration techniques to be practiced by the isters allow the use of register renaming to lessen the 
code morphing software which are unavailable in hardware problem of instructions needing the same hardware 
processors and could not be implemented in those proces- 65 resources. The additional registers also allow the mainte- 
sors except at exorbitant cost. These techniques significantly nance of a set of host or working registers for processing the 
increase the speed of the microprocessor of the present host instructions and a set of target registers to hold the 
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official state of the target processor for which the target that operatioo is not provided by the hardware. These require 

application was created. The target (or shadow) registers are that the exception handler return the operation to the next 

connected to their working register equivalents through a step in the translation after the exception has been handled, 

dedicated interface that allows an operation called "commit" Each of these different types of exceptions may be efficiently 

to quickly transfer the content of all working registers to 5 handled by the present invention. 

official target registers and allows an operation called "roll- Additionally, some exceptions are generated by host hard- 
back to quickly transfer the content of all official target wafe and detect a varf of hos , and conditions . Some 
registers back to their worktng register equivalents. The exce p ti ons behave like exceptions on a conventional 
gated store buffer stores working memory state changes on microprocessorj but others m used by lhe code m0 rphing 
an "uncommitted" side of a hardware gate and official 10 software , 0 detect failure of various specu i ations . i n lhese 
memory state changes on a "committed" side of the hard- cas6Sj ^ ^ hin totC9/mtt ^ the state saving 
ware gate where these committed stores "drain to main and testoring mec hanisms described above, causes the target 
memory. A commit operation transfers stores from the ^ lQ be rest[)red tQ kf . most recent offida] yersion and 
uncommitted s,de of the gate to the committed side of the generates and saves a new trans i ation (or re-uses a previ- 
gate. The additional official registers and the gated store 15 ous , generat ed safe translation) which avoids the failed 
buffer allow the state of memory and the state of the target speculation . ^ , ran slation is then executed, 
registers to be updated together once one or a group of target m . , . j,.. 1, j 
instructions have been translated and run without error. A ^ «norph host includes additional hardware exception 
__ . A . , lt _ , ,. - detection mechanisms that in conjunction with the rollback 
These updates are chosen by the code morphing software and retranslale method descfibed above a „ ow j. 

o occur on integral target instruction boundar.es. Thus if 20 mizatioQ . Examples are a means to distinguish memory from 

the primitive host instructions making up a translation or a r , , » i- • . 

, H f . r . memory mapped I/O and a means to eliminate memory 

series or target instructions are run by the host processor c J . rr , . , , ' 

. , 4 B . . . ' . . ^ references by protectmg addresses or address ranges thus 

without generating exceptions, then the working memory „ . ; • <i * u i * - ■ * 

° . . & • ; ' j • i ■ allowing target variables to be kept in registers, 
stores and working register state generated by those mstruc- , . . , , 
tions are transferred to official memory and to the official 25 L For the case where exceptions are used to detect failure of 
target registers. In this manner, if an exception occurs when other speculations, such as whether an operation affects 
processing the host instructions at a point which is not on the raemorv or memorv . ma PP ed 1 °> recovery is accomplished 
boundary of one or a set of target instructions being by the generation of new translations with different memory 
translated, the original state in the target registers at the last operations and different optimizations, 
update (or commit) may be recalled to the working registers 30 FIG - 2 1S a dia g ram of morph host hardware 12 designed 
and uncommitted memory stores in the gated store buffer in accordance with the present invention represented run- 
may be dumped. Then, for the case where the exception ning the same application program which is being run on the 
generated is a target exception, the target instructions caus- CISC processor of FIG. 1(a). As may be seen, the micro- 
ing the target exception may be retranslated one at a time and processor 10 includes the code morphing software 11 por- 
executed in serial sequence as they would be executed by a 35 lion and the enhanced hardware morph host portion 12 
target microprocessor. As each target instruction is correctly described above. The target application furnishes the target 
executed without error, the state of the target registers may instructions (together referred to as 13) to the code morphing 
be updated; and the data in the store buffer gated to memory. software 11 for translation into host instructions which the 
Then, when the exception occurs again in running the host morph host 12 is capable of executing. In the meantime, the 
instructions, the correct state of the target computer is held 40 tar g et °P era ting system receives calls from the target appli- 
by the target registers of the morph host and memory; and calion program and transfers these to the code morphing 
the operation may be correctly handled without delay. Each software 11 . In a preferred embodiment of the invention, the 
new translation generated by this corrective translating may morph host 12 is a very long instruction word (VLIW) 
be cached for future use as it is translated or alternatively processor which is designed with a plurality of processing 
dumped for a one time or rare occurrence such as a page 45 channels. The overall operation of such a processor is further 
fault. This allows the microprocessor created by the com- illustrated in FIG. 6(c). 

bination of the code morphing software and the morph host In FIGS. 6(a)-( c ) are illustrated instructions adapted for 

to execute the instructions more rapidly than processors for use with each of a CISC processor, a RISC processor, and a 

which the software was originally written. VLIW processor. As may be seen, the CISC instructions are 

It should be noted that in executing target programs using 50 of varied lengths and may include a plurality of more 

the microprocessor of the present invention, many different primitive operations (e.g., load and add). The RISC 

types of exceptions can occur which are handled in different instructions, on the other hand, are of equal length and are 

manners. For example, some exceptions are caused by the essentially primitive operations. The single very long 

target software generating an exception which utilizes a instruction for the VLIW processor illustrated includes each 

target operating system exception handler. The use of such 55 of the more primitive operations (i.e., load, store, integer 

an exception handler requires that the code morphing soft- add, compare, floating point multiply, and branch) of the 

ware include routines for emulating the entire exception CISC and RISC instructions. As may be seen in FIG. 6(c), 

handling process including any hardware provided by the each of the primitive instructions which together make up a 

target computer for handling the process. This requires that single very long instruction word is furnished in parallel 

the code morphing software provide for saving the state of 60 with the other primitive instructions either to one of a 

the target processor so that it may proceed correctly after the plurality of separate processing channels of the VLIW 

exception has been handled. Some exceptions like a page processor or to memory to be dealt with in parallel by the 

fault, which requires fetching data in a new page of memory processing channels and memory. The results of all of these 

before the process being translated may be implemented, parallel operations are transferred into a multiported register 

require a return to the beginning of the process being 65 file* 

translated after the exception has been handled. Other excep- A VU W processor which may be the basis of the morph 

tions implement a particular operation in software where host 12 is a much simpler processor than the other proces- 
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sors described above. It does not include circuitry to detect translated) will be found in the translation buffer 14 all but 

issue dependencies or to reorder, optimize, and reschedule once for each one million or so executions of the translation, 

primitive instructions. This, in turn, allows faster processing Consequently, after a first translation, all of the steps 

at higher clock rates than is possible with either the proces- required for translation such as decoding, fetching primitive 

sors for which the target application programs were origi- 5 instructions, optimizing the primitive instructions, resched- 

nally designed or other processors using emulation programs uling into a host translation, and storing in the translation 

to run target application programs. However, the invention buffer 14 may be eliminated from the processing required, 

is not limited to VLIW processors and may function as well Since the processor for which the target instructions were 

with any type of processor such as a RISC processor. written must decode, fetch, reorder, and reschedule each 

The code morphing software 11 of the microprocessor 10 10 instruction each time the instruction is executed, this dras- 

shown in FIG. 2 includes a translator portion which decodes tically reduces the work required for executing the target 

the instructions of the target application, converts those instructions and increases the speed of the microprocessor of 

target instructions to the primitive host instructions capable the present invention. 

of execution by the morph host 12, optimizes the operations In eliminating all of these steps required in execution of 

required by the target instructions, reorders and schedules 15 a target application by prior art processors, the micropro- 

the primitive instructions into VLIW instructions (a cessor 10 of the present invention overcomes problems of 

translation) for the morph host 12, and executes the host the prior art which made the operations of the present 

VLIW instructions. The operations of the translator are invention impossible at any reasonable speed. For example, 

illustrated in FIG. 7 which illustrates the operation of the some of the techniques of the present invention were used in 

main loop of the code morphing software 11. 2 o me emulators described above used for porting applications 

In order to accelerate the operation of the microprocessor to other systems. However, some of these emulators had no 

10 which includes the code morphing software 11 and the way of running more than short portions of applications 

enhanced morph host hardware 12, the code morphing because in processing translated instructions, exceptions 

software includes a translation buffer 14 as is illustrated in which generate calls to various system exception handlers 

FIG. 2. The translation buffer 14 of one embodiment is a 2 s were generated at points in the operation at which the state 

software data structure which may be stored in memory; a of the host processor had no relation to the state of a target 

hardware cache might also be utilized in a particular processor processing the same instructions. Because of this, 

embodiment. The translation buffer is used to store the host the state of the target processor at the point at which such an 

instructions which embody each completed translation of the exception was generated was not known. Thus, correct state 

target instructions. As may be seen, once the individual 30 of the target machine could not be determined; and the 

target instructions have been translated and the resulting operation would have to be stopped, restarted, and the 

host instructions have been optimized, reordered, and correct state ascertained before the exception could be 

rescheduled, the resulting host translation is stored in the serviced and execution continued. This made running an 

translation buffer. The host instructions which make up the application program at host speed impossible, 

translation are then executed by the morph host 12. If the 35 The morph host hardware 12 of the present invention 

host instructions are executed without generating an includes a number of enhancements which overcome this 

exception, the translation may thereafter be recalled when- problem. These enhancements are each illustrated in FIGS, 

ever the operations required by the target instruction or 3, 4, and 5. In order to determine the correct state of the 

instructions are required. registers at the time an error occurs, a set of official target 

Thus, as shown in FIG. 7, a typical operation of the code 40 registers 42 is provided by the enhanced hardware to hold 
morphing software 11 of the microprocessor 10 when fur- the state of the registers of the target processor for which the 
nished the address of a target instruction by the application original application was designed. These target registers 42 
program is to first determine whether the target instruction may be included in each of the floating point units 34, any 
at the target address has been translated. If the target integer units 32, and any other execution units. These official 
instruction has not been translated, it and subsequent target 45 registers have been added to the morph host 12 of the present 
instructions are fetched, decoded, translated, and then invention along with an increased number of normal work- 
possibly) optimized, reordered, and rescheduled into a new ing registers 41 so that a number of optimizations including 
host translation, and stored in the translation buffer 14 by the register renaming may be practiced. One embodiment of the 
translator. As will be seen later, there are various degrees of enhanced hardware includes sixty-four working registers 41 
optimization which are possible. The term "optimization" is 50 in the integer unit and thirty-two working registers 41 in the 
often used generically in this specification to refer to those floating point unit. The embodiment also includes an 
techniques by which processing is accelerated. For example, enhanced set of target registers 42 which include all of the 
reordering is one form of optimization which allows faster frequently changed registers of the target processor neces- 
processing and which is included within the term. Many of sary to provide the state of that processor; these include 
the optimizations which are possible have been described 55 condition control registers and other registers necessary for 
within the prior art of compiler optimizations, and some control of the simulated system. 

optimizations, which were difficult to perform within the It should be noted that depending on the type of enhanced 
prior art like "super-blocks" come from VLIW research. processing hardware utilized by the morph host, a translated 
Control is then transferred to the translation to cause execu- instruction sequence may include primitive operations 
tion by the enhanced morph host hardware 12 to resume. 60 which constitute a plurality of target instructions from the 
When the particular target instruction sequence is next original application. For example, a VLIW microprocessor 
encountered in running the application, the host translation may be capable of running a plurality of either CISC or 
will then be found in the translation buffer 14 and immedi- RISC instructions at once as is illustrated in FIGS. 6(a)-(c). 
ately executed without the necessity of translating, Whatever the morph host type, the state of the target 
optimizing, reordering, or rescheduling. Using the advanced 65 registers 42 of the morph host hardware of the invention is 
techniques described below, it has been estimated that the not changed except at an integral target instruction bound- 
translation for a target instruction (once completely ary; and then all target registers are updated. Thus, if the 
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microprocessor of the present invention is executing a target 
instruction or instructions which have been translated into a 
series of primitive instructions which may have been reor- 
dered and rescheduled into a host translation, when the 
processor begins executing the translated instruction 5 
sequence, the official target registers 42 hold the values 
which would be held by the registers of the target processor 
for which the application was designed when the first target 
instruction was addressed. After the morph host 12 has 
begun executing the translated instructions, however, the 
working registers hold values determined by the primitive 
operations of the translated instructions executed to that 
point Thus, while some of these working registers may hold 
values which are identical to those in the official target 
registers, others of the working registers hold values which 
are meaningless to the target processor. This is especially 15 
true in an embodiment which provides many more registers 
than does a particular target machine in order to allow 
advanced acceleration techniques. Once the translated host 
instructions begin, the values in the working registers are 
whatever those translated host instructions determine the 20 
condition of those registers to be. If a set of translated host 
instructions is executed without generating an exception, 
then the new working register values determined at the end 
of the set of instructions are transferred together to the 
official target registers 42 (possibly including a target 25 
instruction pointer register). In the present embodiment of 
the invention, this transfer occurs outside of the execution of 
the host instructions in an additional pipeline stage so it does 
not slow operation of the morph host 12. 

In a similar manner, a gated store buffer 50 such as that 30 
illustrated in FIG. 5 is utilized in the present invention to 
control the transfer of data to memory. The gated store buffer 
50 includes a number of elements each of which may hold 
the address and data for a memory store operation. These 
elements may be implemented by any of a number of 35 
different hardware arrangements (e.g., first-in first-out 
buffers); the embodiment illustrated is implemented utiliz- 
ing random access memory and three dedicated working 
registers. The three registers store, respectively, a pointer 51 
to the head of the queue of memory stores, a pointer 52 to 40 
the gate, and a pointer 53 to the tail of the queue of the 
memory stores. Memory stores positioned between the head 
of the queue and the gate are already committed to memory, 
while those positioned between the gate of the queue and the 
tail are not yet committed to memory. Memory stores 45 
generated during execution of host translations are placed in 
the store buffer 50 by the integer unit in the order generated 
during the execution of the host instructions by the morph 
host but are not allowed to be written to memory until a 
commit operation is encountered in a host instruction. Thus, 50 
as translations execute, the store operations are placed in the 
queue. Assuming these are the first stores so that no other 
stores are in the gated store buffer 50, both the head and gate 
pointers will point to the same position. As each store is 
executed, it is placed in the next position in the queue and 55 
the tail point is incremented to the next position (upward in 
the figure). This continues until a commit command is 
executed. This will normally happen when the translation of 
a set of target instructions has been completed without 
generating an exception or a error exit condition. When a 60 
translation has been executed by the morph host 12 without 
error, then the memory stores in the store buffer 50 generated 
during execution are moved together past the gate of the 
store buffer 50 (committed) and subsequently written to 
memory. In the embodiment illustrated, this is accomplished 65 
by copying the value in the register holding the tail pointer 
to the register holding the gate pointer. 
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Thus, it may be seen that both the transfer of register state 
from working registers 41 to official target registers 42 and 
the transfer of working memory stores to official memory 
occur together and only on boundaries between integral 
target instructions in response to explicit commit operations. 

This allows the microprocessor 10 to recover from target 
exceptions which occur during execution by the enhanced 
morph host 12 without any significant delay. If a target 
exception is generated during the running of any translated 
instruction or instructions, that exception is detected by the 
morph host hardware or software. In response to the detec- 
tion of the target exception, the code morphing software 11 
may cause the values retained in the official registers to be 
placed back into the working registers 41 and any non- 
committed memory stores in the gated store buffer 50 to be 
dumped (an operation referred to as "rollback"). The 
memory stores in the gated store buffer 50 of FIG. 5 may be 
dumped by copying the value in the register holding the gate 
pointer to the register holding the tail pointer. 

Placing the values from the target registers 42 into the 
working registers 41 may place the address of the first of the 
target instructions which were running when the exception 
occurred in the working instruction pointer register. Begin- 
ning with this official state of the target processor in the 
working registers, the target instructions which were run- 
ning when the exception occurred are retranslated in serial 
order without any reordering or other optimizing. After each 
target instruction is newly decoded and translated into a new 
host translation, the translated host instruction representing 
the target instructions is executed by the morph host 12 and 
causes or does not cause an exception to occur. (If the morph 
host 12 is other than a VLIW processor, then each of the 
primitive operations of the host translation is executed in 
sequence. If no exception occurs as the host translation is 
run, the next primitive function is run.) This continues until 
an exception re-occurs or the single target instruction has 
been translated and executed. In one embodiment, if a 
translation of a target instruction is executed without an 
exception being generated, then the state of working regis- 
ters 41 is transferred to the target registers 42 and any data 
in the gated store buffer 50 is committed so that it may be 
transferred to memory. However, if an exception re-occurs 
during the running of a translation, then the state of the target 
registers and memory has not changed but is identical to the 
state produced in a target computer when the exception 
occurs. Consequently, when the target exception is 
generated, the exception will be correctly handled by the 
target operating system. 

Similarly, once a first target instruction of the series of 
instructions the translation of which generated an exception 
has been executed without generating an exception, the 
target instruction pointer points to the next of the target 
instructions. This second target instruction is decoded and 
retranslated without optimizing or reordering in the same 
manner as the first. As each of the host translations of a 
single target instruction is processed by the morph host 12, 
any exception generated will occur when the state of the 
target registers and memory is identical to the state which 
would occur in the target computer. Consequently, the 
exception may be immediately and correctly handled. These 
new translations may be stored in the translation buffer 14 as 
the correct translations for that sequence of instructions in 
the target application and recalled whenever the instructions 
are rerun. 

Other embodiments of the invention for accomplishing 
the same result as the gated store buffer 50 of FIG. 5 might 
include arrangements for transferring stores directly to 
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memory while recording data sufficient to recover state of 
the target computer in case the execution of a translation 
results in an exception or an error necessitating rollback. In 
such a case, the effect of any memory stores which occurred 
during translation and execution would have to be reversed 5 
and the memory state existing at the beginning of the 
translation restored; while working registers would have to 
receive data held in the official target registers in the manner 
discussed above. One embodiment for accomplishing this 
maintains a separate target memory to hold the original 10 
memory state which is then utilized to replace overwritten 
memory if a rollback occurs. Another embodiment for 
accomplishing memory rollback logs each store and the 
memory data replaced as they occur, and then reverses the 
store process if rollback is required. 15 

The code morphing software of the present invention 
provides an additional operation which greatly enhances the 
speed of processing programs which are being translated. In 
addition to simply translating the instructions, optimizing, 
reordering, rescheduling, caching, and executing each trans- 20 
lation so that it may be rerun whenever that set of instruc- 
tions needs to be executed, the translator also links the 
different translations to eliminate in almost all cases a return 
to the main loop of the translation process. FIG. 8 illustrates 
the steps carried out by the translator portion of the code 25 
morphing software 11 in accomplishing this linking process. 
It will be understood by those skilled in the art that this 
linking operation essentially eliminates the return to the 
main loop for most translations of instructions, which elimi- 
nates this overhead. 30 

Presume for exemplary purposes that the target program 
being run consists of X86 instructions. When a translation of 
a sequence of target instructions occurs and the primitive 
host instructions are reordered and rescheduled, two primi- 
tive instructions may occur at the end of each host transla- 35 
tion. The first is a primitive instruction which updates the 
value of the instruction pointer for the target processor (or its 
equivalent); this instruction is used to place the correct 
address of the next target instruction in the target instruction 
pointer register. Following this primitive instruction is a 40 
branch instruction which contains the address of each of two 
possible targets for the branch. The manner in which the 
primitive instruction which precedes the branch instruction 
may update the value of the instruction pointer for the target 
processor is to test the condition code for the branch in the 45 
condition code registers and then determine whether one of 
the two branch addresses indicated by the condition con- 
trolling the branch is stored in the translation buffer 14. The 
first time the sequence of target instructions is translated, the 
two branch targets of the host instruction both hold the same 50 
host processor address for the main loop of the translator 
software. When the host translation is completed, stored in 
the translation buffer 14, and executed for the first time, the 
instruction pointer is updated in the target instruction pointer 
register (as are the rest of the target registers); and the 55 
operation branches back to the main loop. At the main loop, 
the translator software looks up the instruction pointer to the 
next target instruction in the target instruction pointer reg- 
ister. Then the next target instruction sequence is addressed. 
Presuming that this sequence of target instructions has not 60 
yet been translated and therefore a translation does not 
reside in the translation buffer 14, the next set of target 
instructions is fetched from memory, decoded, translated, 
optimized, reordered, rescheduled, cached in the translation 
buffer, and executed. Since the second set of target instruc- 65 
tions follows the first set of target instructions, the primitive 
branch instruction at the end of the host translation of the 



first set of target instructions is automatically updated to 
substitute the address of the host translation of the second set 
of target instructions as the branch address for the particular 
condition controlling the branch. 

If then, the second translated host instruction were to loop 
back to the first translated host instruction, the branch 
operation at the end of the second translation would include 
the main loop address and the X86 address of the first 
translation as the two possible targets for the branch. The 
update-instruction-pointer primitive operation preceding the 
branch tests the condition and determines that the loop back 
to the first translation is to be taken and updates the target 
instruction pointer to the X86 address of the first translation. 
This causes the translator to look in the translation buffer 14 
to see if the X86 address being sought appears there. The 
address of the first translation is found, and its value in host 
memory space is substituted for the X86 address in the 
branch at the end of the second host translated instruction. 
Then, the second host translated instruction is cached and 
executed. This causes the loop to be run until the condition 
causing the branch from the first translation to the second 
translation fails, and the branch takes the path back to the 
main loop. When this happens, the first translated host 
instruction branches back to the main loop where the next set 
of target instructions designated by the target instruction 
pointer is searched for in the translation buffer 14, the host 
translation is fetched from the cache; or the search in the 
translation buffer fails, and the target instructions are fetched 
from memory and translated. When this translated host 
instruction is cached in the translation buffer 14, its address 
replaces the main loop address in the branch instruction 
which ended the loop. 

In this manner, the various translated host instructions are 
chained to one another so that the need to follow the long 
path through the translator main loop only occurs where a 
link does not exist. Eventually, the main loop references in 
the branch instructions of host instructions are almost com- 
pletely eliminated. When this condition is reached, the time 
required to fetch target instructions, decode target 
instructions, fetch the primitive instructions which make up 
the target instructions, optimize those primitive operations, 
reorder the primitive operations, and reschedule those primi- 
tive operations before running any host instruction is elimi- 
nated. Thus, in contrast to all prior art microprocessors 
which must take each of these steps each time any applica- 
tion instruction sequence is run, the work required to run any 
set of target instructions using the present invention after the 
first translation has taken place is drastically reduced. This 
work is further reduced as each set of translated host 
instructions is linked to the other sets of translated host 
instructions. In fact, it is estimated that translation will be 
needed in less than one translation execution out of one 
million during the running of an application. 

Those skilled in the art will recognize that the implemen- 
tation of the present invention requires a large translation 
buffer since each set of instructions which is translated is 
cached in order that it need not be translated again. Trans- 
lators designed to function with applications programmed 
for different systems will vary in their need for supporting 
buffer memory. However, one embodiment of the invention 
designed to run X86 programs utilizes a translation buffer of 
two megabytes of random access memory. 

Two additional hardware enhancements help to increase 
the speed at which applications can be processed by the 
microprocessor 10 of the present invention. The first of these 
is an abnormal/normal (A/N) protection bit stored with each 
address translation in a translation look-aside buffer 
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31(TLB) (see FIG. 3) where lookup of the physical address 
of target instructions is first accomplished. Target memory 
operations within translations can be of two types, ones 
which operate on memory (normal) or ones which operate 
on a memory mapped I/O device (abnormal). 

A normal access which affects memory completes nor- 
mally. When instructions operate on memory, the optimizing 
and reordering of those instructions is appropriate and 
greatly aids in speeding the operation of any system using 
the microprocessor 10 of the present invention. On the other 
hand, the operations of an abnormal access which affects an 
I/O device often must be practiced in the precise order in 
which those operations are programmed without the elimi- 
nation of any steps or they may have some adverse affect at 
the I/O device. For example, a particular I/O operation may 
have the effect of clearing an I/O register; if the primitive 
operations take place out of order, then the result of the 
operations may be different than the operation commanded 
by the target instruction. Without a means to distinguish 
memory from memory mapped I/O, it is necessary to treat 
all memory with the conservative assumptions used to 
translate instruction which affect memory mapped I/O. This 
severely restricts the nature of optimizations that are achiev- 
able. Because prior art emulators lacked means to both 
detect a failure of speculation on the nature of the memory 
being addressed, and means to recover from such failures, 
their performance was restricted. 

In one embodiment of the invention, the A/N bit is 
initially set in the translation look-aside buffer to indicate a 
memory page, A translation of an operation which affects 
memory as though it were a memory operation is actually a 
speculation that the operation is one affecting memory. After 
the translation has been accomplished and executes, the 
target memory reference is checked by comparing the access 
type (normal, or abnormal) against the TLB A/N protection 
bit. When the access type does not match the A/N protection, 
an exception occurs. If the operation in fact affects memory, 
then the optimizing, reordering, and rescheduling techniques 
described above were correctly applied. If the comparison of 
the A/N bit in the TLB 31 shows that the operation, however, 
affects an I/O device, then execution causes an exception to 
be taken; and the translator produces a new translation one 
target instruction at a time without optimizing, reordering, or 
rescheduling of any sort. Similarly, if a translation incor- 
rectly assumes an I/O operation for an operation which 
actually affects memory, execution causes an exception to be 
taken; and the target instructions are retranslated using the 
optimizing, reordering, and rescheduling techniques. In this 
manner, the present invention can enhance performance 
beyond what has been traditionally possible. 

One of the most frequent speculations practiced by the 
present invention is that target exceptions will not occur 
within a translation. This allows significant optimization 
over the prior art. First, target state does not have to be 
updated on each target instruction boundary, but only on 
target instruction boundaries which occur on translation 
boundaries. This eliminates instructions necessary to save 
target state on each target instruction boundary. Optimiza- 
tions that would previously have been impossible in sched- 
uling and removing redundant operations are also made 
possible. 

The present invention is admirably adapted to select the 
appropriate process of translation. In accordance with the 
method of translating described above, a set of instructions 
may first be translated as though it were to affect memory. 
When the optimized, reordered, and rescheduled host 
instructions are then executed, the address may be found to 
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refer to an I/O device by the condition of the A/N bit 
provided in the translation look-aside buffer. The compari- 
son of the A/N bit and the translated instruction address 
which shows that an operation is an I/O operation generates 

5 an error exception which causes a software initiated rollback 
procedure to occur, causing any uncommitted memory 
stores to be dumped and the values in the target registers to 
be placed back into the working registers. Then the trans- 
lation starts over, one target instruction at a time without 

1Q optimization, reordering, or rescheduling. This re-translation 
is the appropriate host translation for an I/O device. 

In a similar manner, it is possible for a memory operation 
to be incorrectly translated as an I/O operation. The error 
generated may be used to cause its correct re -translation 

15 where it may be optimized, reordered, and rescheduled to 
provide faster operation. 

Prior art emulators have also struggled with what is 
generally referred to as self modifying code. Should a target 
program write to the memory that contains target 

20 instructions, this will cause translations that exist for these 
target instructions to become "stale" and no longer valid. It 
is necessary to detect these stores as they occur dynamically. 
In the prior art, such detection has to be accomplished with 
extra instructions for each store. This problem is larger in 

25 scope than programs modifying themselves. Any agent 
which can write to memory, such as a second processor or 
a DMA device, can also cause this problem. 

The present invention deals with this problem by another 
enhancement to the morph host 12. A translation bit (T bit) 

30 which may also be stored in the translation look-aside buffer 
is used to indicate target memory pages for which transla- 
tions exist. The T bit thus possibly indicates that particular 
pages of target memory contain target instructions for which 
host translations exist which would become stale if those 

35 target instructions were to be overwritten. If an attempt is 
made to write to the protected pages in memory, the pres- 
ence of the translation bit will cause an exception which 
when handled by the code morphing software can cause the 
appropriate translation(s) to be invalidated or removed from 

40 the translation buffer. The T bit can also be used to mark 
other target pages that translation may rely upon not being 
written. 

This may be understood by referring to FIG. 3 which 
illustrates in block diagram form the general functional 

45 elements of the microprocessor 10 of the invention. When 
the morph host 12 executes a target program, it actually runs 
the translator portion of the code morphing software 11 
which includes the only original untranslated host instruc- 
tions which effectively run on the morph host 12. To the 

50 right in the figure is illustrated memory divided into a host 
portion including essentially the translator and the transla- 
tion buffer 14 and a target portion including the target 
instructions and data, including the target operating system. 
The morph host hardware begins executing the translator by 

55 fetching host instructions from memory and placing those 
instructions in an instruction cache. The translator instruc- 
tions generate a fetch of the first target instructions stored in 
the target portion of memory. Carrying out a target fetch 
causes the integer unit to look to the official target instruc- 

60 tion pointer register for a first address of a target instruction. 
The first address is then accessed in the translation look- 
aside buffer 31 of the memory management unit 33. The 
memory management unit 33 includes hardware for paging 
and provides memory mapping facilities for the TLB 31. 

65 Presuming that the TLB 31 is correctly mapped so that it 
holds lookup data for the correct page of target memory, the 
target instruction pointer value is translated to the physical 
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address of the target instruction. At this point, the condition In order to illustrate the operation of the processor of the 

of the bit (T bit) indicating whether a translation has been present invention and the manner in which acceleration of 

accomplished for the target instruction is detected, but the execution occurs, the translation of a small sample of X86 

access is a read operation and no T bit exception will occur. target code to host primitive instructions is presented at this 

The condition of the A/N bit indicating whether the access 5 po i ntt The sample illustrates the translation of X86 target 

is to memory or memory mapped I/O is also detected. instructions to morph host instructions including various 

Presuming the last mentioned bit indicates a memory exemplary steps of optimizing, reordering, and rescheduling 

location, the target instruction is accessed in target memory by the microprocessor of the invention. By following the 

since no translation exists. The target instruction and sub- process illustrated, the substantial difference between the 

sequent target instructions are transferred as data to the 10 operations required to execute the original instructions using 

morph host computing units and translated under control of t he target processor and the operations required to execute 

the translator instructions stored in the instruction cache. the translation on the host processor will become apparent to 

The translator instructions utilize reordering, optimizing, those skilled in the art. 

and rescheduling techniques as though the target instruction .... . „ . . . ^ , 

~. a, ... . i The onguial instruction illustrated in C language source 

affected memory. The resulting translation containing a k , . *?, , . r . . r °. ... 

/. 4 . -z . * . .i , , r. J code describes a very brief loop operation. Essentially, while 

sequence of host instructions is then stored in the translation . , . ti „ ' , . * . r , , , _/ , 

i_ rr i ,i • i_ , . . . f j some variable n which is being decremented alter each 

buffer 14 in host memory. The translation is transferred . 4 . urxn & . (( „ . . , . 

...... i L rr ^ a * u . ■ *l loop remains greater than O , a value c is stored at an 

directly to the translation buffer 14 in host memory via the ,f • . \«* » i_. i_ • t_ • 

. j . , a fft A ... i . . . j • address indicated by a pointer *s which is being mcre- 

gated store buffer 50. Once the translation has been stored in men t ec j a fter each loo 

host memory, the translator branches to the translation which 2 o ' 
then executes. The execution (and subsequent executions) 

will determine if the translation has made correct assump- 

tions concerning exceptions and memory. Prior to executing Original c code 

the translation, the T bit for the target page(s) containing the whiie( (n--)[00ib]>0) { 

target instructions that have been translated is set. This 2 s *s++=c 

indication warns that the instruction has been translated; win32 x86 inslructions produced by a cotnpUcr comp iii ng this c code. 

and, if an attempt to write to the target address occurs, the mov %ecx,[%ebp+0xc] // load c from memory address into %ecx 

attempt generates an exception which causes the translation mov %eax,[%ebp+0x8] // load s from memory address into %eax 

to possibly be invalidated or removed. mov // storec into memory address s held in 

An additional hardware enhancement to the morph host is 30 at jd %eax,#4 // increment s by 4 

a circuit utilized to allow data which is normally stored in mov [%ebp+Ox8],%cax // store (s + 4) back into memory 

memory but is Used quite Often in the execution of an mov %eax,[%ebp+0x1O] // load n from memory address into %eax 

operation to be replicated (or "aliased") in an execution unit lca Sf^inVi » d ™TV- 7* ^ " 

* . . , ,. . . . -j . . mov [%ebp+0x30],%ecx // store (n-1) into memory 

register in order to eliminate the time required to fetch the and ^a^eax // test n to set condition codes 

data from memory on each use. To accomplish this in one 35 jg .-Oxib // branch to top of this section if "n>0" 

embodiment, the morph host is designed to respond to a 

"load and protect" command which copies the memory data N otatioQ: \ ■ • J in * catcs , an ad ? rcss cx P rcssion for a memory operand m 

r , . . r ; . the example above, the address for a memory operand is formed from the 

to a working register 111 in an execution unit 110 Shown in contents of a register added to a hexadecimal constant indicated by the Ox 

FIG. 11 and places the memory address in a register 112 in prefix. Target registers are indicated with the % prefix, "e.g. %ecx is the ecx 

that unit. Associated with the address register is a compara- 40 register. The destination of an operation is to the left, 

tor 113. The comparator receives the addresses of loads and J? jumpif^eatc^ 

stores to the gated store buffer 50 directed to memory during mov = move 

translations. If a memory address for either a load or a store lea » load effective address 

compares with an address in the register 112 (or additional and " AND 

registers depending on the implementation), an exception is 45 , . „ . „ , , 

generated. The code morphing software responds to the In this first P 0rtl0n of the sam P le > each of lhe ^vidual 
exception by assuring that the memory address and the X86 assembly language instructions for carrying out the 
register hold the same correct data. In one embodiment, this execution of the operation defined by the C language state- 
is accomplished by rolling back the translation and reex- meDt IS hs ! ed b ? ^ assembly language mnemonic for the 
ecuting it without any "aliased" data in an execution register. 50 °P eraUon foIlowed b y the Parameters involved m the par- 
Other possible methods of correcting the problem are to ticular Punitive operation. An explanation of the operation 
update the register with the latest memory data or memory 15 als0 P rovlded in a for each instruction. Even 
with the latest load data. thou S h the order of executl0D ma y be varied by the target 
It will be recognized by those skilled in the art that the P rocessor from that shown > each of these assembly language 
, , tT t . J t . , , , . er instructions must be executed each lime the loop is executed 
host processor of the present invention may be connected in 55 . „ ^, . r — 
. ... 4 ■ . t . ♦ . *• „ „ ♦ in carrying out the target C anguage instructions. Thus, it 

cu-cu it with typical computer elements to form a computer . , P . . . . . . . 

u .l. . 11 * t 1 * r nTr , n A . . . the loop is executed one hundred times, each instruction 

such as that illustrated in FIG. 9. As may be seen, when used , \ , . , .... 

. VOiC . . , t J ■ • j u shown above must be earned out one hundred times, 
in a modern X86 computer the host processor is joined by a 

processor bus to memory and bus control circuitry. The 

memory and bus control circuitry is arranged to provide 60 ^ _ 

access to main memory as well as to cache memory which shows each X86 instruction shown above followed by the host 

may be Utilized with the microprocessor. The memory and instructions necessary to implement the X86 Instruction. 

bus Control Circuitry also provides access tO a bus SUCh as a mov %ecx^%ebp+Oxc] // load c from memory address into 

PCI or other local bus through which I/O devices may be add R0)Rebpj0xc . ^ memory address and put it ^ R0 

accessed. The particular computer system Will depend Upon 65 id RecxjRO] ; load c from memory address in R0 into 

the circuitry utilized with a typical microprocessor which the Recx 
microprocessor of the present invention replaces. 
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-continued 



-continued 



mov 


fljcax^ voeDp+uxtsj 


// load s from memory address into %eax 




st 


(K4JjKecx 


; store c into memory address s 


add 


R2,Rebp,0x8 


form memory address and put it in R2 




add 


%eax,#4 


// increment s by 4 


Id 




\ load s from memory address in R2 into 


5 


addec 


Rc ax,R eax,4 


5 increment s by 4 






Recx 




mov 


[ /be d p+ux d j, ax ax 


// store (s + 4) to memory 


mov 


t%eaxjj%ecx 


// store c into memory address s held in 




add 


JOjKeopji/xo 


; form logical address into R5 






%eax 




chid 


R5,Rss limit 


; Check logical address against segment 


st 


[Reax],Recx 


; store c into memory address s held in Reax 








lower limit 


add 




// increment s by 4 




chku 


RS,R FFFFFFFF 


; Check logical address against segment 


add 


Reax,Reax,4 

I /ceop+uxo j, /©eax 


; increment S by 4 


1 n 






upper limit 


mov 


// store [s + h) oacK into memory 




aau 


Ort D ^ Dec h-ic# 


; add segment base to form linear 


add 


RS,Rebp,0x8 


; form memory address and put it in R5 








address 


st 


[R5),Reax 


; store (s + 4) back into memory 




st 


Ttxrl d ...... 

[Ro^Reax 


; store (s + 4) to memory address in R6 


mov 


%eax,[%ebp+0xl0] 


// load n from memory address into %eax 






%e ax J[ %e bp+Ox 10] 


// load n 


add 


R7,Rebp,0xl0 


; form memory address and put it in R7 




add 


R7,Rebp,0xl0 


; form logical address into R7 


Id 


Reax } [R7] 


; load n from memory address into Reax 


15 


chkl 


R7,Rss_limit 


; Check logical address against segment 


lea 


%ecxi%eax-l] 


// decrement n and store result in %ecx 






lower limit 


sub 


Recx,Reax,l 
[%ebp+0xl0],%ecx 


; decrement n and store result in Recx 




chku 


R7,R_FFFKFFi'K 


; Check logical address against segment 


mov 


// store (n - i) into memory 








upper limit 


add 


R9,Rebp,0xl0 


; form memory address and put it in R9 




add 


R8,R7,Rss_base 


; add segment base to form linear 


st 


[R9],Recx 


; store (n - 1) into memory 








address 


and 


%eax,%eax 


// test n to set condition codes 


20 


Id 


ReaxjR8] 


; load n from memory address in R8 into 


andec 


Rll.Reax^eax 


; test n to set condition codes 






Reax 


jg 


.-Ox lb 


// branch to top of this section if "n>0" 




lea 


%ecx,t%eax-l] 


// degrement n 


jg 


mainloop,mainloop 


; jump to main loop 




sub 


Recx,Reax,l 


; decrement n 










mov 


[%ebr>+Oxl0] ) %ecx 


// store (n - 1) 










Host Instruction key: 






add 


R9,Rebp,0xl0 


; form logical address into R9 


Id - load 




25 


chkl 


R9,Rss_limit 


; Check the logical address against 


add - 


ADD 








segment lower limit 



st - store 
sub - subtract 

jg o jump if condition codes indicate greater 
andec = and set the condition codes 



The next sample illustrates the same target primitive 
instructions which carry out the C language instructions. 
However, following each primitive target instruction are 
listed primitive host instructions required to accomplish the 
same operation in one particular embodiment of the micro- 
processor of the invention in which the morph host is a 
VLIW processor designed in the manner described herein. It 
should be noted that the host registers which are shadowed 
by official target registers are designated by an "R" followed 
by the X86 register designation so that, for example, Reax 
is the working register associated with the EAX official 
target register. 



Adds host instructions necessary to perform X86 address computation and 
upper and lower segment limit checks. 

// load c 

; form logical address into R0 
; Check logical address against segment 

lower limit 
; Check logical address against segment 

upper limit 
; add segment base to form linear 
address 

; load c from memory address in Rl 

into Reax 
// load s 

; form logical address into R0 
; Check logical address against segment 

lower limit 
; Check logical address against segment 

upper limit 
; add segment base to form linear 
address 

; load s from memory address in R3 

into Ra 
// store c into [s] 

; Check logical address against segment 

upper limit 
; add segment base to form linear 
address 
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chku R9,R_FFFFFFFF 

add R10,R9,Rss_base 

st [R10],Recx 

and %eax,%eax 

andec Rll,Reax,Reax 

jg .-Oxlb 

jg mainloop,mainloop 



; Check logical address against segment 

upper limit 
; add segment base to form the linear 
address 

; store n-1 in Recx into memory using 

address in RIO 
// test n to set condition codes 
; test n to set condition codes 
// branch to top of this section if "n>0" 
; jump to main loop 



mov 


%ecx^%ebp+Oxc] 


add 


R0,Rebp,0xc 


chkl 


R0,Rss_limit 


chku 


R0,R_FFFFFFFF 


add 


Rl,R0,Rss_base 


Id 


Recx,[Rl] 


mov 


%eax,[%ebp+0x8] 


add 


R2,Rebp,0x8 


chkl 


R2,Rss_limit 


chku 


R 2, R__ FFFFFFFF 


add 


R3,R2,Rss base 


Id 


Reax,[R3j 


mov 


[%eax],%ecx 


chku 


Reax,Rds_limit 


add 


R4,Reax 1 Rds_base 



35 Host Instruction key: 
chkl + check lower limit 
chku » check upper limit 

The next sample illustrates for each of the primitive target 
40 instructions the addition of host primitive instructions by 
which addresses needed for the target operation may be 
generated by the code morphing software. It should be noted 
that host address generation instructions are only required in 
an embodiment of a microprocessor in which code morphing 
45 software is used for address generation rather than address 
generation hardware. In a target processor such as an X86 
microprocessor these addresses are generated using address 
generation hardware. Whenever lo address generation 
occurs in such an embodiment of the invention, the calcu- 
50 lation is accomplished; and host primitive instructions are 
also added to check the address values to determine that the 
calculated addresses are within the appropriate X86 segment 
limits. 



55 

Adds instructions to maintain the target X86 instruction pointer "eip" and 
the commit instructions that use the special morph host hardware to 
update 
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X86 state, 
mov 
add 
chkJ 
chku 
add 
Id 
add 

commit 
mov 



%ecxJ%ebp+Oxc] 
'ROjRebp.Oxc 
R0 t Rss_limit 
R0,R_FFFFFFFF 
Rl,R0,Rss„base 
RecxjRl] 
Reip,Reip,3 

%eaxj%ebp+0x8] 



// load c 



; add XS6 instruction length to eip in Reip 
; commits working slate to official state 
// load s 
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-continued 



28 



add 


R2,Rebp,0xB 




chkl 


R2,Rss_limit 




chku 


R2,R_FFFFFFFF 




add 


R3,R2,Rss_base 




Id 


Rcax,[R3] 




add 


Reip,Reip,3 


; add X86 instruction length to eip in Reip 


commit 




; commits working state to ofEcial state 


mov 


[%eax],%ecx 


// store c into [s] 


chku 


Reax,Rds limit 




add 


R4,Reax,Rds_base 




St 


(R4),Recx 




add 


ReipjReip^ 


; add X86 instruction length to eip in Reip 


commit 




; commits working state to official state 


add 


%eax,#4 


// increment s by 4 


addcc 


Reax,Reax,4 




add 


Reip,Reip,5 


; add XS6 instruction length to eip in Reip 


commit 




; commits working state to official state 


mov 


[%ebp+0x8],%eax 


// store (s + 4) 


add 


R5,Rebp,0x8 




chkl 


R5,Rss_limit 




chku 


RS,R_FFFFFFFF 




add 


R6,RS,Rss_base 




St 


[R6],Reax 




add 


ReipjReip^ 


; add X86 instruction length to eip in Reip 


commit 




; commits working state to official state 


mov 


%eax,[%ebp+OxlO] 


// load n 


add 


R7,Rebp,0xl0 




chkl 


R7,Rss_limit 




chku 


R7,R_FFFFFFFF 




add 


R8,R7,Rss_base 




Id 


Reax,[R8] 




add 


Reip,Reip,3 


; add X86 instruction length to eip in Reip 


commit 




; commits working state to official state 


tea 


%ccx 3 [%eax-l] 


// decrement n 


sub 


Recx,Reax,l 




add 


Rcip,Rcip,3 


; add X86 instruction length to eip in Reip 


commit 




; commits working state to official state 


mov 


[%cbp+OxlO],%ccx 


// store (n - 1) 


add 


R9,Rebp,0x30 




chkl 


R9,Rss_limit 




chku 


R9,R_FFFFFFFF 




add 


Kiu,Ky,Kss case 




St 


[RIOlRecx 




add 


Reip,Reip,3 


; add X86 instruction length to eip in Reip 


commit 




; commits working state to official state 


and 


%eax,%eax 


// test n 


andec 


Rll,Reax,Reax 




add 


Rcip,Reip,3 




commit 




; commits working state to official state 


jg 


.-Oxlb 


// branch "n>0" 


add 


Rseq f Reip,Length 

Gg) 




Idc 


Rtarg,ELP(target) 




selec 


Reip,Rseq ) Rtarg 




commit 




; commits working state to official state 


jg 


main loop ,maiiiloop 





Host Instruction key: 

commit - copy contents of working registers to official target registers and 
send working stores to memory 

This sample illustrates the addition of two steps to each 
set of primitive host instructions to update the official target 
registers after the execution of the host instructions neces- 
sary to carry out each primitive target instruction and to 
commit the uncommitted values in the gated store buffer 50 
to memory. As may be seen, in each case, the length of the 
target instruction is added to the value in the working 
instruction pointer register (Reip). Then a commit instruc- 
tion is executed. In one embodiment, the commit instruction 
copies the current value of each working register which is 
shadowed into its associated official target register and 
moves a pointer value designating the position of the gate of 
the gated store buffer from immediately in front of the 
uncommitted stores to immediately behind those stores so 
that they will be placed in memory. 

It will be appreciated that the list of instructions illustrated 
last above are all of the instructions necessary to form a host 
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translation of the original target assembly language instruc- 
tions. If the translation were to stop at this point, the number 
of primitive host instructions would be much larger than the 
number of target instructions (probably six times as many 
instructions), and the execution could take longer than 
execution on a target processor. However, at this point, no 
reordering, optimizing, or rescheduling has yet taken place. 

If an instruction is to be run but once, it may be that the 
time required to accomplish further reordering and other 
optimization is greater than the time to execute the transla- 
tion as it exists at this point. If so, one embodiment of the 
present invention ceases the translation at this point, stores 
the translation, then executes it to determine whether excep- 
tion or errors occur. In this embodiment, steps of reordering 
and other optimization only occur if it is determined that the 
particular translation will be run a number times or other- 
wise should be optimized. This may be accomplished, for 
example by placing host instructions in each translation 
which count the number of times a translation is executed 
and generate an exception (or branch) when a certain value 
is reached. The exception (or branch) transfers the operation 
to the code morphing software which then implements some 
or all of the following optimizations and any additional 
optimizations determined useful for that translation. A sec- 
ond method of determining translations being run a number 
of times and requiring optimization is to interrupt the 
execution of translations at some frequency or on some 
statistical basis and optimize any translation running at that 
time. This would ultimately provide that the instructions 
most often run would be optimized. Another solution would 
be to optimize each of certain particular types of host 
instructions such as those which create loops or are other- 
wise likely to be run most often. 



Optimization 

Assumes 32 bit flat address space which allows the elimination 
of segment and base additions some limit checks. 

Win32 uses Flat 32b segmentation 
Record Assumptions: 
Rss_base==0 
Rss_limit==0 
Rds_base==0 
Rds_limit==FFFFFFFF 
SS and DS protection check 



50 



55 



60 



65 



mov 

add 

chku 

Id 

add 

commit 

mov 

add 

chku 

Id 

add 

commit 
mov 
chku 
st 

add 

commit 
add 
addcc 
add 

commit 
mov 
add 
chku 



%ecx,[%ebp+Oxcj 
R0,Rebp,0xc 
R0,R_FFFFFFFF 
RecxjRO] 
Reip,Reip,3 

%eax,[%ebp+0x8] 

R2,Rebp,0x8 

R2,R_FFFFFFFF 

Reax,[R2j 

Reip,Reip,3 

[%eax],%ecx 
Reax,R_FFFFFFFF 
[Reax],Recx 
Rcip,Reip,2 

%cax,#4 

Reax,Reax,4 

Reip,Reip,5 

[%ebp+OxBj,%eax 
R5,Rebp,0x8 
R5,R_FFFFFFFF 
[R5],Reax 



// load c 



// load s 



// store c into [s] 



// increment s by 4 



// store (s + 4) 
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-continued 



Optimization 



add 


Reip,Reip,3 




commit 






mov 


%eaxJ%cbp+OxlO] 


// load n 


add 


R7,Rebp,0xl0 




chku 


R7,R_FFFFFFFF 




Id 


Reax,[R7] 




add 


Reip,Reip,3 




commit 






lea 


%ecx,[%eax-l] 


// decrement n 


sub 


RecXjReax,} 




add 


Reip,Reip,3 




commit 






mov 


[%ebp+OxlO],%ecx 


// store (n - 1) 


add 


R9,Rebp,0xl0 




chku 


R9,R_FFFFFFFF 




St 


ER9],Recx 




add 


Reip,Reip,3 




commit 






and 


%eax,%eax 


// test n 


aadcc 


Rll,Reax,Reax 




add 


Reip^Reip^J 




commit 






jg 


.-Oxlb 


// branch n>"0" 


add 


Rseq,Re ip. Ljc ngt h(j g) 




ldc 


Rtarg,E[P(target) 




selcc 


Reip,Rseq,Rtarg 




commit 






jg 


mainloop,mainloop 
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This sample illustrates a first stage of optimization which 30 
may be practiced utilizing the present invention. This stage 
of optimization, like many of the other operations of the 
code morphing software, assumes an optimistic result. The 
particular optimization assumes that a target application 
program which has begun as a 32 bit program written for a 
flat memory model provided by the X86 family of proces- 
sors will continue as such a program. It will be noted that 
such an assumption is particular to the X86 family and 
would not necessarily be assumed with other families of 
processors being emulated. 

If this assumption is made, then in X86 applications all 
segments are mapped to the same address space. This allows 
those primitive host instructions required by the X86 seg- 
mentation process to be eliminated. As may be seen, the 
segment values are first set to zero. Then, the base for data 
is set to zero, and the limit set to the maximum available 
memory. Then, in each set of primitive host instructions for 
executing a target primitive instruction, the check for a 
segment base value and the computation of the segment base 
address required by segmentation are both eliminated. This 
reduces the loop to be executed by two host primitive 
instructions for each target primitive instruction requiring an 
addressing function. At this point, the host instruction check 55 
for the upper memory limit still exists. 

It should be noted that this optimization requires the 
speculation noted that the application utilizes a 32 bit flat 
memory model. If this is not true, then the error will be 6Q 
discovered as the main loop resolves the destination of 
control transfers and detects that the source assumptions do 
not match the destination assumptions. A new translation 
will then be necessary. This technique is very general and 
can be applied to a variety of segmentation and other 65 
"moded" cases where the "mode" changes infrequently, like 
debug, system management mode, or "real" mode. 
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Assume data addressed includes no bytes outside of computer 
memory limits which can only occur on unaligned page crossing 
memory references at the upper memory limit, and can be handled 
by special case software or hardware. 
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mov 
add 
Id 

add 

commit 
mov 
add 
Id 

add 

commit 

mov 

st 

add 

commit 
add 
addec 
add 

commit 
mov 
add 
st 

add 

commit 

mov 

add 

Id 

add 

commit 
lea 
sub 
add 

commit 
mov 
add 
st 

add 

commit 
and 
andec 
add 

commit 

jg 

add 

ldc 

selcc 

commit 

jg 



%ecx,[%ebp+0xc] 
R0,Rebp,0xc 
RecxjRO] 
Reip,Reip,3 

%eax,[%ebp+0x8] 
R2,Rebp,0x8 
Reax,[R2] 
Reip,Reip,3 

[%eax],%ecx 
[Reax],Recx 
Reip,Reip,2 

%eax,#4 

Reax,Reax,4 

Reip,Reip,5 

[%ebp+0xS],%eax 
R5,Rebp,0x8 
[R5],Reax 
Reip,Reip,3 

%eax,[%ebp+0xl0j 
R7,Rebp t 0xl0 
Reax,[R7] 
Reip,Reip,3 

%ecx,[%cax-l] 

RecXjReax,! 

Reip,Reip,3 

[%ebp+0xl0],%ecx 
R9,Rebp,0xl0 
[R9],Recx 
Reip,Reip,3 

%eax,%eax 

RlljReaXjReax 

Reip,Reip,3 

.-Oxlb 

Rseq ,Reip,Le ngth (jg) 

Rtarg,EIP(targct) 

Reip,Rseq,Rtarg 

main loop ,ma inloop 



// load c 



// load s 



// store c into [s] 



// increment s by 4 



// store (s + 4) 



// load n 



// decrement n 



// store (n - 1) 



// test n 



// branch "n^" 



Host Instruction key: 

selcc = Select one of source registers and copy its contents to destination 
register based on condition codes. 

The above sample illustrates a next stage of optimization 
in which a speculative translation eliminates the upper 
memory boundary check which is only necessary for 
unaligned page crossing memory references at the top of the 
memory address space. Failure of this assumption is 
detected by either hardware or software alignment fix up. 
This reduces the translation by another host primitive 
instruction for each target primitive instruction requiring 
addressing. This optimization requires both the assumption 
noted before that the application utilizes a 32 bit flat memory 
model and the speculation that the instruction is aligned. If 
these are not both true, then the translation will fail when it 
is executed; and a new translation will be necessary. 



Detect and eliminate redundant address calculations. The example 
shows the code after eliminating the redundant operations, 
mov Sfcecx^ebp+Oxc] // load c 
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•continued 



add 


R0,Rebp,Oxc 




Id 


RecxjRO] 




add 


Reip,Reip,3 




commit 






mov 


%eax,[%ebp+0x8] 


// load s 


add 


R2 } Rebp,0x8 




Id 


Reax,[R2] 




add 


Reip,Reip,3 




commit 






mov 


[%eax],%ecx 


// store c into [s] 


St 


[Reax],Recx 




add 


Reip,Reip,2 




commit 






add 


%eax t #4 


// increment s by 4 


addcc 


Reax,Reax,4 


add 


Rcip,Reip,5 




commit 






mov 


[%ebp+0x8],%eax 


// store (s + 4) 


St 


[R2],Reax 




add 


Reip,Reip,3 




commit 






mov 


%eax,[%ebp+OxlO] 


// load n 


add 


R7,Rebp,0xl0 




Id 


Reax,[R7] 




add 


Reip,Reip,3 




commit 






lea 


%ecx,[%eax-l] 


// decrement n 


sub 


Recx,Reax,l 




add 


Reip,Reip,3 










mov 


[%ebp+OxlO] J %ecx 


// store (n - 3) 


St 


[R7],Recx 




add 


Reip,Rcip,3 




commit 






and 


%eax,%eax 


// test n 


andcc 


Rll,Reax,Rcax 




add 


Reip,Reip,3 




commit 






jg 


.-Oxlb 


// branch "n>0" 


add 


Rs eg,Re ip, Lengt h(jg) 




ldc 


Rtar^EIPClarget) 




selcc 


ReLp,Rseq ( Rtarg 




commit 






jg 


mai nloop ,mainJoop 
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This sample illustrates a next optimization in which 
common host expressions are eliminated. More particularly, 
in translating the second target primitive instruction, a value 45 
in working register Rebp (the working register representing 
the stack base point register of an X86 processor) is added 
to an offset value 0x8 and placed in a host working register 
R2. It will be noted that the same operation took place in 50 
translating target primitive instruction five in the previous 
sample except that the result of the addition was placed in 
working register R5. Consequently the value to be placed in 
working register R5 already exists in working register R2 
when host primitive instruction five is about to occur. Thus, 55 
the host addition instruction may be eliminated from the 
translation of target primitive instruction five; and the value 
in working register R2 copied to working register R5. 
Similarly, a host instruction adding a value in working 6Q 
register Rebp to an offset value 0x10 may be eliminated in 
the translation of target primitive instruction eight because 
the step has already been accomplished in the translation of 
target primitive instruction six and the result resides in 
register R7, It should be noted that this optimization does not 65 
depend on speculation and consequently is not subject to 
failure and retranslation. 



Assume that target exceptions will not occur within the translation 


so delay updating eip and target state. 




mov 


%ecx,[%ebp+0xc] 


// load c 


add 


RO,Rebp,Qxc 




Id 


Recx,[R0] 




mov 


%eax,[%ebp+0x8] 


// load s 


add 


R2,Rebp ) 0x8 




Id 


Reax^R2] 




mov 


[%eax],%ecx 


// store c into [s] 


St 


[Reax],Recx 




add 


%eax,#4 


// increment s by 4 


add 


Reax,Reax,4 




mov 


[%ebp+0x8],%eax 


// store (s + 4) 


St 


[R2) ,Reax 




mov 


%eax ) [%ebp+0xl0] 


// load n 


add 


R7,Rebp,0xl0 




Id 


Rcax,[R7] 




lea 


%ecx,[%eax-l] 


// decrement n 


sub 


Recx,Reax,l 




mov 


[%ebp+0xl0],%ecx 


// store (n - 1) 


St 


[R71Recx 




and 


%eax,%eax 


// test n 


andcc 


Rll.ReaXjReax 




jg 


.-Oxlb 


// branch "n>0" 


add 


Rseq,Reip,Length (block) 




ldc 


Rtarg,EIP(target) 




selcc 


Reip,Rseq,Rtarg 




commit 






jg 


mainloop,mainloop 





The above sample illustrates an optimization which 
speculates that the translation of the primitive target instruc- 
tions making up the entire translation may be accomplished 
without generating an exception. If this is true, then there is 
no need to update the official target registers or to commit 
the uncommitted stores in the store buffer at the end of each 
sequence of host primitive instructions which carries out an 
individual target primitive instruction. If the speculation 
holds true, the official target registers need only be updated 
and the stores need only be committed once, at the end of the 
sequence of target primitive instructions. This allows the 
elimination of two primitive host instructions for carrying 
out each primitive target instruction. These are replaced by 
a single host primitive instruction which updates the official 
target registers and commits the uncommitted stores to 
memory. 

As will be understood, this is another speculative opera- 
tion which is also highly likely to involve a correct specu- 
lation. This step offers a very great advantage over all prior 
art emulation techniques if the speculation holds true. It 
allows all of the primitive host instructions which carry out 
the entire sequence of target primitive instructions to be 
grouped in a sequence in which all of the individual host 
primitives may be optimized together. This has the advan- 
tage of allowing a great number of operations to be run in 
parallel on a morph host which takes advantage of the very 
long instruction word techniques. It also allows a greater 
number of other optimizations to be made because more 
choices for such optimizations exist. Once again, however, 
if the speculation proves untrue and an exception is taken 
when the loop is executed, the official target registers and 
memory hold the official target state which existed at the 
beginning of the sequence of target primitive instructions 
since a commit does not occur until the sequence of host 
instructions is actually executed. All that is necessary to 
recover from an exception is to dump the uncommitted 
stores, rollback the official registers into the working 
registers, and restart translation of the target primitive 
instructions at the beginning of the sequence. This 
re-translation produces a translation of one target instruction 
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at a time, and the official state is updated after the host 
sequence representing each target primitive instruction has 
been translated. This translation is then executed. When the 
exception occurs on this re-translation, correct target state is 
immediately available in the official target registers and 
memory for carrying out the exception. 



In summary: 


add 


R0,Rebp,0xc 




Id 


RecxjRO] 




add 


R2,Rebp,0x8 




Id 


Reax,[R2] 




St 


[ReaxLRecx 




add 


Reax,Reax,4 




St 


[R2lReax 




add 


R7,Rebp,0xl0 




Id 


Reax,[R7] 


// Live out 


sub 


Recx,Reax,l 


// Live out 


St 


[R7LRecx 




andcc 


Rll,Reax,Reax 




add 


Rseq,Reip,Length(block) 




ldc 


Rtarg,EIP(target) 




selcc 


Reip,Rseq,Rtarg 




commit 






jg 


mainloop,mainI oop 





The comment "Live Out" refers to the need to actually maintain Reax and 
Recx correctly prior to the commit. Otherwise further optimization might be 
possible. 

The summary above illustrates the sequence of host 
promitive instruction which remain at this point in the 
optimization process. While this example shows the main- 
tenance of the target instruction pointer (EIP) inline, it is 
possible to maintain the pointer EIP for branches out of line 
at translation time, which would remove the pointer EIP 
updating sequence from this and subsequent steps of the 
example. 



Renaming to reduce register resource dependencies. This will allow 
subsequent scheduling to be more effective. From this point on, the 
original target X86 code is omitted as the relationship between individual 
target X86 instructions and host instructions becomes increasingly blurred. 
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add 
Id 
add 
Id 
st 

add 
st 

add 
Id 
sub 
st 

andcc 

add 

ldc 

selcc 

commit 

jg 



R0,Rebp,0xc 
R1JR0] 
R2,Rebp,0x8 
R3JR2] 
[R3LR1 
R4,R3,4 
[R2LR4 
R7,Rebp ) OxlO 
Reax,[R7] 
Recx,Reax,l 
[R7LRecx 
Rll,Reax,Reax 
Rseq,Reip ,Le ngthfb lock) 
Rtarg,Erp(target) 
Reip,Rseq,Rtarg 

mainloop,mainloop 



// Live out 
// Live out 
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This sample illustrates a next step of optimization, nor- 
mally called register renaming, in which operations requir- 
ing working registers used for more than one operation in the 
sequence of host promitive instructions are changed to 
utilize a different unused working register to eliminate the 
possibility that two host instructions will require the same 
hardware. Thus, for example, the second host primitive 55 
instruction in two samples above uses working register Recx 
which represents an official target register ECX. The tenth 
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host primitive instruction also uses the working register 
Recx. By changing the operation in the second host primi- 
tive instruction so that the value pointed to by the address in 
R0 is stored in the working register Rl rather than the 
register Recx, the two host instructions do not both use the 
same register. Similarly, the fourth, fifth, and sixth host 
primitive instructions all utilize the working register Reax in 
the earlier sample; by changing the fourth host primitive 
10 instruction to utilize the previously unused working register 
R3 instead the working register Reax and the sixth host 
primitive instruction to utilize the previously unused work- 
ing register R4 instead of the register Reax, these hardware 
dependencies are eliminated. 



After the scheduling process which organizes the primitive host operations 
as multiple operations that can execute in the parallel on the host VLrW 
hardware. Each line shows the parallel operations that the VLFW machine 
executes, and the indicates the parallelism. 
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add 

nop 

Id 

Id 

st 

Id 

st 

st 

selcc 



R2,Rebp,0x8 

R3 ; [R2] 

R1,[R0] 

[R3],R1 

Reax,[R7] 

[R2LR4 

[R7],Recx 

Reip,Rseq,Rtarg 



Host Instruction key: 
nop - no operation 



& add R0,Rebp,0xc 
& add R7,Rebp,0xl0 
& add Rseq,Reip,Length(block) 
& add R4,R3,4 
& ldc Rtarg,EIP(target) 
& nop 

& sub Recx,Reax,l 

& andcc Rll,Reax,Reax 

& jg mainloop.mainloop & commit 
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The above sample illustrates the scheduling of host primi- 
tive instructions for execution on the morph host. In this 
example, the morph host is presumed to be a VLIW pro- 
cessor which in addition to the hardware enhancements 
provided for cooperating with the code morphing software 
also includes, among other processing units, two arithmetic 
and logic (ALU) units. The first line illustrates two indi- 
vidual add instructions which have been scheduled to run 
together on the morph host. As may be seen, these are the 
third and the eight primitive host instructions in the sample 
just before the summary above. The second line includes a 
NOP instruction (no operation but go to next instruction) and 
another add instruction. The NOP instruction illustrates that 
there are not always two instructions which can be run 
together even after some scheduling optimizing has taken 
place. In any case, this sample illustrates that only nine sets 
of primitive host instructions are left at this point to execute 
the original ten target instructions. 



Resolve host branch targets and chain stored translations 
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add 


R2,Rebp,0x8 


& add R0,Rebp,Oxc 


nop 




& add R7,Rebp J 0xt0 


Id 


R3,[R2] 


& add Rseq,Reip,Length(block) 


Id 


R1,[R0] 


& add R4,R3,4 


st 


[R3],R1 


& ldc Rtarg,EIP(target) 


Id 


Reax,[R7] 


& nop 


st 


[R2],R4 


& sub Recx,Reax,l 


st 


[R7],Recx 


& andcc R 11, Reax, Reax 


selcc 


Reip,Rseq,Rtarg 


& jg SequentialjTarget & commit 



This sample illustrates essentially the same set of host 
primitive instructions except that the instructions have by 
now been stored in the translation buffer 14 and executed 
one or more times because the last jump (jg) instruction now 
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points to a jump address furnished by chaining to another 
sequence of translated instructions. The chaining process 
takes the sequence of instructions out of the translator main 
loop so that translation of the sequence has been completed. 



Advanced Optimizations, Backward Code Motion: 

This and subsequent examples start with the code prior to scheduling. 

This optimization first depends on detecting that the code is a loop. 

Then invariant operations can be moved out of the loop body and 

executed 

once before entering the loop body. 
Entry: 



Loop: 
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add 


R0,Rebp,0xc 


add 


R2,Rebp,0x8 


add 


R7,Rebp,0xl0 


add 


Rseq,Reip,Length(block) 


ldc 


Rtarg,EIP(target) 


Id 


R1,[R0] 


Id 


R3,[R2] 


St 


[R3],R1 


add 


R4,R3,4 


St 


[R21R4 


Id 


Reax,[R7] 


sub 


Recx,Reax,l 


St 


fR7],Recx 


andec 


Rll,Reax,Reax 


selec 


Reip,Rseq,Rtarg 


commit 




jg 


mainloop,Loop 



The above sample illustrates an advanced optimization 
step which is usually only utilized with sequences which are 
to be repeated a large number of times. The process first 
detects translations that form loops, and reviews the indi- 
vidual primitives host instructions to determine which 
instructions produce constant results within the loop body. 
These instructions are removed from the loop and executed 
only once to place a value in a register; from that point on, 
the value stored in the register is used rather than rerunning 
the instruction. 



Schedule the loop body after backward code motion. For example 
purposes, 

only the code in the loop body is shown scheduled 
Entry: 



add 


RO^RebpjOxc 




add 


R2,Rebp,0x8 




add 


R7,Rcbp,0xl0 




add 


Rseq,Reip,Length(block) 


ldc 


Rtarg.ElP(target) 




Loop: 






Id 


R3,[R2] 


& nop 


Id 


R1,[R0] 


& add R4,R3,4 


St 


[R31R1 


& nop 


Id 


Reax^R7] 


& nop 


St 


[R21R4 


& sub Recx,Reax,l 


St 


[R7],Recx 


& andec Rll,Reax,Reax 


RCICC 


Reip,Rseq,Rtarg 


& jg Sequential,Ixiop & commit 


Host Instruction key: 





ldc = load a 32-bit constant 
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When these non-repetitive instructions are removed from 
the loop and the sequence is scheduled for execution, the 
scheduled instructions appear as in the last sample above. It 
can be seen that the initial instructions are performed but 
once during the first iteration of the loop and thereafter only 
the host primitive instructions remaining in the seven clock 
intervals shown are executed during the loop. Thus, the 
execution time has been reduced to seven instruction inter- 
vals from the ten instructions necessary to execute the 
primitive target instructions. 
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As may be seen, the steps which have been removed from 
the loop are address generation steps. Thus, address genera- 
tion only need be done once per loop invocation in the 
present invention; that is, the address generation need only 
be done one time. On the other hand, the address generation 
hardware of the X86 target processor must generate these 
addresses each time the loop is executed. If a loop is 
executed one hundred times, the present invention generates 
the addresses only once while a target processor would 
generate each address one hundred times. 



After Backward Code Motion: 
Target: 



Loop: 
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add 
add 
add 
add 
tdc 

Id 
td 
st 

add 

st 

Id 

sub 
st 

andec 
selec 
commit 
jg 



R0,Rebp,0xc 
R2,Rebp,0x8 
R7,Rebp,0xl0 
Rseq,Reip,Len gth (block) 
Rtarg,EIP(target) 

R1JR0] 
R3JR2] 
[R3],R1 
R4,R3,4 
[R2LR4 

Reax,[R7] //Live out 
Recx,Reax,l //Live out 
[R7],Recx 
Rll,Reax,Reax 
Reip,Rseq,Rtarg 



ma in loop, Loop 
Register allocation: 
This shows the use of register alias detection hardware of the morph 
host that allows variables to be safely moved from memory into 
registers. The starting point is the code after "backward code motion". 
This shows the optimization that can eliminate loads. 
First the loads are performed. The address is protected by the alias 
hardware, such that should a store to the address occur, an "alias" 
exception is raised. The loads in the loop body are then replaced with 
copies. After the main body of the loop, the alias hardware is freed. 
Entry: 

add R0,Rebp,0xc 
add R2,Rebp,0x8 
add R7,Rebp,0xl0 
add Rseq,Reip,Length(block) 
ldc Rtarg,EIP(target) 
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Loop: 



Id 

prot 
Id 

prot 
Id 

prot 
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Epilog: 



copy 
copy 
st 

add 

copy 

st 

copy 
sub 
copy 
st 

andec 
selec 
commit 
jg 



Rc^RO] 
[R0],Aliasl 
Rs,[R2] 
[R2],Alias2 
RnjR7] 
[R7],Alias3 



Rl,Rc 

R3,Rs 

[R3],R1 

R4,Rs,4 

Rs,R4 

[R2],Rs,NoAliasCheck 
Reax,Rn //Live out 
Recx,Reax,l //Live out 
Rn,Rccx 

lR7],Rn,noAliasCheck 

Rll,Reax,Reax 

Reip,Rseq,Rtarg 



;First do the load of variable from memory 
;Then protect memory location from stores 



Epilog, Loop 



FA Aliasl 
FA Alias2 
FA Alias3 
j Sequential 
Host Instruction key: 
protect » protect address from loads 
copy - copy j = jump 



Free the alias detection hardware 
Free the alias detection hardware 
Free the alias detection hardware 



FA » free alias 
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This sample illustrates an even more advanced optimiza- 
tion which may be practiced by the microprocessor of the 
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present invention. Referring back to the second sample 
before this sample, it will be noticed that the first three add 
instructions involved computing addresses on the stack. 
These addresses do not change during the execution of the 
sequence of host operations. Consequently, the values stored 
at these addresses may be retrieved from memory and 
loaded in registers where they are immediately available for 
execution. As may be seen, this is done in host primitive 
instructions six, eight, and ten. In instructions seven, nine 
and eleven, each of the memory addresses is marked as 
protected by special host alias hardware and the registers are 
indicated as aliases for those memory addresses so that any 
attempt to vary the data will cause an exception. At this 
point, each of the load operations involving moving data 
from these stack memory addresses becomes a simple 
register-to-register copy operation which proceeds much 
faster than loading from a memory address. It should be 
noted that once the loop has been executed until n-0, the 
protection must be removed from each of the memory 
addresses so that the alias registers may be otherwise 
utilized. 



Copy Propagation: 

After using the alias hardware to turn loads within the loop body into 
copies, copy propagation allows the elimination of some copies. 
Entry: 



Loop: 



add 


RC^RebpjOxc 


add 


R2,Rebp f 0x8 


add 


R7,Rebp,0xl0 


add 


Rseq ,Reip,Le ngth(b lock) 


ldc 


Rtarg,EIP(target) 


Id 


RcjRO] 


prot 


[ROLAliasl 


Id 


Rs,[R2] 


prot 


[R2],Alias2 


Id 


Recx,lR7] 


prot 


[R7],Alias3 


St 


[RslRc 


add 


Rs,Rs,4 
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-continued 
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L r\x jj hj» jFN u/\i i as \^,n e tut 




copy 


Reax,Recx 


//Live out 


sub 


Recx^eaXjl 


//Live out 


st 


^LRecx^NoAliasCheck 




andcc 


Rll,Reax,Reax 




selcc 


Reip,Rscq,Rtarg 




commit 






jg 


Epilog,Loop 




Epilog: 






FA 


Aliasl 




FA 


Alias2 




FA 


Alias3 




j 


Sequential 





This sample illustrates the next stage of optimization in 
which it is recognized that most of the copy instructions 
which replaced the load instructions in the optimization 
illustrated in the last sample are unnecessary and may be 
25 eliminated. That is, if a register-to-register copy operation 
takes place, then the data existed before the operation in the 
register from which the data was copied. If so, the data can 
be accessed in the first register rather than the register to 
3 q which it is being copied and the copy operation eliminated. 
As may be seen, this eliminates the first, second, fifth, and 
ninth primitive host instructions shown in the loop of the last 
sample. In addition, the registers used in others of the host 
primitive instructions are also changed to reflect the correct 
registers for the data. Thus, for example, when the first and 
second copy instructions are eliminated, the third store 
instruction must copy the data from the working register Rc 
where it exists (rather than register Rl) and place the data at 
the address indicated in working register Rs where the 
address exists (rather than register R3). 
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Example illustrating scheduling of the loop body only. 
Entry: 

add R0,Rebp,0xc 

add R2,Rebp,0x8 

add R7,Rebp,0xl0 

add Rseq, Re ip, Length (block) 

ldc Rtarg,EIP(target) 

Id Rc,[R0] 

prot [R0],Aliasl 

Id Rs,[R2] 

prot [R2]>Uias2 

Id RecxjR7] 

prot [R7],Alias3 



Loop: 



Epilog: 



st [Rs],Rc, & add Rs,Rs,4 & copy Reax.Recx 

st [R2],Rs,NAC & sub Recx,Reax,l 

st [R7],Recx,NAC & andcc Rll,Reax,Reax 

selcc Reip,Rseq,Rtarg & jg Epilog.Loop & commit 

FA Aliasl 

FA Alias2 



FA Alias3 
j Sequential 
Host Instruction key: 

NAC - No Alias Check 
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The scheduled host instructions are illustrated in the 
sample above. It will be noted that the sequence is such that 
fewer clocks are required to execute the loop than to execute 
the primitive target instruction originally decoded from the 
source code. Thus, apart from all of the other acceleration 
accomplished, the total number of combined operations to 
be run is simply less than the operations necessary to execute 
the original target code. 
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4. A microprocessor comprising the combination of trans- 
lation software, and 
host hardware, 

in which the translation software is code morphing 
software and the host hardware is morph host 
hardware, 

in which the code morphing software comprises: 
processes to translate target instructions of a program 
written for a processor having a first instruction set 



Store Elimination by use of the alias hardware. 
Entry: 



Loop: 



Epilog: 



add 


R0,Rebp,Oxc 




add 


R2,Rebp,0x8 




add 


RT^ebp^lO 




add 


Rseq, Reip ,Le ngth(b lock) 


Idc 


Rtarg,EIP(target) 




Id 


Rc,[R0] 




prot 


[R0lAiias3 


;protect address from loads and stores 


Id 


Rs,[R2] 




prot 


[R2],Alias2 


;protect address from loads and stores 


Id 


RecxjR7] 




prot 


|R7],Alias3 


;protect address from loads and stores 


St 


[Rs],Rc, 


& add Rs,Rs,4 & copy Reax,Recx 


sub 


Recx,Reax,l 


& andcc Rll,Reax > Reax 


selcc 


Reip,Rseq,Rtarg 


& jg Eptlog,Loop & commit 


FA 


Alias 1 




FA 


Alias2 




FA 


Alias3 




St 


[R2lRs 


jwriteback final value of Rs 


st 


[R7lRecx 


; writeback final value of Recx 


j 


Sequential 





The final optimization shown in this sample is the use of 
the alias hardware to eliminate stores. This eliminates the 
stores from within the loop body, and performs them only in 
the loop epilog. This reduces the number of host instructions 
within the loop body to three compared to the original ten 
target instructions. 

Although the present invention has been described in 
terms of a preferred embodiment, it will be appreciated that 
various modifications and alterations might be made by 
those skilled in the art without departing from the spirit and 
scope of the invention. For example, although the invention 
has been described with relation to the emulation of X86 
processors, it should be understood that the invention 
applies just as well to programs designed for other processor 
architectures, and programs that execute on virtual 
machines, such as P code, Postscript, or Java programs. The 
invention should therefore be measured in terms of the 
claims which follow. 

What is claimed is: 

1. A microprocessor comprising the combination of trans- 
lation software, and 

host hardware, 

the translation software running directly on the host 
hardware, 

the translation software responding to target instructions 
by generating instructions to run on the host hardware. 

2. A microprocessor as claimed in claim 1 in which the 
translation software is code morphing software and the host 
hardware is morph host hardware. 

3. A microprocessor as claimed in claim 2 in which the 
enhanced morph host hardware comprises a very long 
instruction word microprocessor. 



into primitive instructions capable of execution on 
the enhanced morph host hardware, and 
processes to store the host primitive instructions as 
host translations in a translation buffer from which 
they may be recalled and executed by the morph 
host hardware any number of times. 

5. A microprocessor as claimed in claim 4 in which the 
code morphing software further comprises processes to 
reorder, reschedule and optimize the primitive host instruc- 
tions. 

6. A microprocessor comprising the combination of trans- 
lation software, 

in which the translation software is code morphing 

software, and host hardware, 
in which the host hardware is morph host hardware, and 
in which the enhanced morph host hardware comprises: 

a store buffer for data being transferred to memory, the 
store buffer including: 

means responding to execution of a host translation 
without an exception to commit data stored in the 
store buffer to memory, and 

means responding to generation of an exception or 
error during execution of a host translation to 
dump data in the store buffer without committing 
it to memory; 
an execution unit comprising: 

a set of working registers in the execution unit larger 
than a set of registers required by a target proces- 
sor in a target processor execution unit; 

a set of target registers for holding official register 
state of a target processor developed in processing 
a target program; and 
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in which the code morphing software comprises: existed at the beginning of a translation of a set of 

means responding to execution of a host translation target instructions during execution of the target 

without an exception or error during execution of program by the microprocessor, 

a host translation for transferring state from the means for updating state of the target computer from 

sets of working registers to the target registers, and 5 state of the host computer when a set of host instruc- 

means responding to generation of an exception or tions executes in accordance with the speculation, 

error during execution of a host translation for means to detect failure of the condition during the 

transferring state from the sets of target registers execution of the set of host instructions, 

to the sets of working registers. means for updating state of the host computer from 

7. A microprocessor as claimed in claim 6: state of the target computer when a set of host 
in which the means responding to execution of a host instructions fails to execute in accordance with the 

translation without an exception or error during execu- speculation, and 

tion of a host translation for transferring state from the means to translate a new set of host instructions without 

sets of working registers to the target registers com- tne speculation when a set of host instructions fails 

prises a commit host instruction, t0 execute in accordance with the speculation. 

in which the means responding to generation of an 1S 15 A microprocessor for a host computer as claimed in 
exception or error during execution of a host translation claim 14 further comprising means to optimize the instruc- 
tor transferring state from the sets of target registers to tions of the host instruction set translated from the target 
the sets of working registers comprises a rollback host program speculating upon the occurrence of a condition, 
instruction, and 16. A microprocessor for a host computer as claimed in 

in which the means responding to execution of a host daim „ jn which me means tQ a ^ of 

translation without an exception to commit data stored . ... „ , 

in the store buffer to memory, and the means respond- instructions into instructions of a host instruction set specu- 

ing to generation of an exception or error during latin 8 u P on the occurrence of a condition speculates that no 

execution of a host translation to dump data in the store exception or error will occur during the translation of the set 

buffer without committing it to memory function in 25 0 f instructions. 

response to the commit host instruction and the roll- 17 a microprocessor for a host computer as claimed in 

back host instruction. claim 14 in which the means to determine under control of 

8. A microprocessor as claimed in claim 6 in which the , he code morphing saBynK official state of the largel 
enhanced morph host hardware comprises means indicating computer which exis(ed a , , he begiming of a trans i ation of 
that a target instruction has been translated. a sgt of target instructions comprises means storing official 

9. A microprocessor as claimed in claim 6 in which the sUte of thc , arge , ^puter which existed at the beginning 
means indicating that a target instruction has been translated of a Nation of a xl of targel ins t rU cti 0 ns. 

includes an address translation cache storing an indication 18 A microprocessor for a host computer as claimed in 

that a target instruction has been translated. }5 daim „ m wbjch , he meaQS stori[)g offidal ^ of tbe 

10. A microprocessor as claimed in claim 9 in which the , arget ^puter which existed at the beginning of a trans- 
means indicating that an instruction is normal or abnormal , ation of a se , of , arge , - mslnJCtioDS comprises a set of target 
includes an address translation cache storing an indication registers 

that a target instruction has been translated. 19 A m i c , oprocessor for a host computer as claimed in 

11. A microprocessor as claimed in claim 6 in which the « c|aim „ fa which tne means fof updaljng state of lne , arge , 
enhanced morph host hardware comprises means indicating cornputer from state of lhe host computer whe n a set of host 
that an instruction is normal or abnormal. instructions executes in accordance with the speculation 

12. A microprocessor as claimed in claim 6 in which the comprises means for transferring host state to the means 
enhanced morph host hardware comprises means indicating 45 stQring offidal ^ rf me (arget co mpu , e r when a set of host 
that data at a memory address has been moved to a register. i nstruct ions executes without an error or exception. 

13. A microprocessor as claimed in claim 12 in which the 2fJ A microprocessor for a host computer as claimed in 
means indicating that data at a memory address has been daim 14 in which [he means tQ de[ecl faUure of tne 
moved to a register includes means storing an indication that condition during the execution of the set of host instructions 
data at the address has been moved to a register comprises 50 comprises means generating an exception . 

means for comparing addresses to be executed with an 21. A microprocessor for a host computer as claimed in 

address of data which has been moved to a register. daim 20 in which me means generating an exception is 

14. A microprocessor for a host computer designed to hardware means 

execute target application programs for a target computer Js 22 A microprocessor for a host computer as claimed in 

having a target instruction set comprising the combination daim 20 in which me means generatmg an exception is 

°^ software means. 

code morphing software, and morph host processing 23. A microprocessor for a host computer as claimed in 

hardware designed to execute instructions of a host daim u [n which ^ means for u ^ ati ^ of ^ host 

instruction set, 60 t c * * c *u * u . c 

, * , , , . „ , , computer from state of the target computer when a set of 

the combination of the code morphing software and the , t . 4 c 4 4 , .... 

. . t . j ■ * host instructions fails to execute in accordance with the 

morph host processing hardware comprising: , . . „.„.., 

means to translate a set of target instructions into speculation comprises means transferring official state of the 

instructions of a host instruction set speculating upon tar 8 et computer which existed at the beginning of a trans- 

the occurrence of a condition, 65 lation of a set of taf g et instructions to update host state, 

means to determine under control of the code morphing 24. A microprocessor for a host computer as claimed in 

software official state of the target computer which claim 14 in which the means to translate a new set of host 
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instructions without the speculation when a set of host 
instructions fails to execute in accordance with the specu- 
lation comprises means beginning at a target instruction 
which existed at the beginning of a translation of a set of 
target instructions which failed to translate each target 5 
instruction in order saving host state as official state of the 
target computer. 

25. A method of executing target programs designed to be 
executed by a target computer having a target instruction set 
on a host computer having a host processor capable of 
executing instructions from a host instruction set different 
than the target instruction set, the method comprising; 

storing state of the target computer as it exists at the 
beginning of translating a target instruction; 15 

translating target instructions commanding an operation 
into a set of host instructions for executing on the host 
processor the operation commanded by the target 
instruction; 20 

storing the host instructions as a host translation in a 
translation buffer; 

executing the host translation on the host processor; 

updating state stored for the target computer from state of 
the host computer when the execution of a host trans- 25 
lation does not generate an exception or error; and 

updating state of the host computer from stored state of 
the target computer when the execution of the host 
translation generates an exception or error. 

26. A method of executing target application programs as 30 
claimed in claim 25 comprising the additional step of 
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optimizing and rescheduling host instructions before storing 
the set of host instructions as a host translation. 

27. A method of executing target application programs as 
claimed in claim 25 comprising the additional steps of: 

translating each target instructions commanding an opera- 
tion into a set of host instructions for executing on the 
host processor each target instruction of the operation 
commanded by the target instructions without 
reordering, optimizing, or rescheduling the primitive 
instructions to generate a host instruction; 

storing each set of host instructions in the translation 
buffer as a host translation as the set is completed; 

storing and updating state of the host computer after each 
target instruction is translated to a set of host 
instructions, 

executing each host translation on the host processor; 

updating state stored for the target computer from state of 
the host computer when the execution of a host trans- 
lation does not generate an exception or error; and 

taking any exception or error generated when executing 
the host instruction generates an exception or an error. 

28. A method of executing target application programs as 
claimed in claim 25 comprising the additional step of 
optimizing and rescheduling host instructions after storing 
the set of host instructions as a host translation if the set of 
host instructions is to be reexecuted often. 
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